RAG Types, Taxonomy & Advanced Patterns
← Back to Overview: RAG
A complete reference: the RAG evolution ladder from Naive to GraphRAG, deep-dive implementations of each pattern, the extended variant taxonomy, and the comparison matrix.
RAG Evolution Overview
Every RAG system sits on this spectrum. Understanding where a system sits is the first question in any architecture discussion.
Naive RAG
↓ add pre/post-retrieval steps
Advanced RAG
↓ decompose into swappable modules
Modular RAG
↓ add autonomous retrieval decisions
Agentic RAG
↓ replace vector index with knowledge graph
GraphRAG
↓ retrieve across image, text, audio
Multimodal RAG
Naive RAG
Concept
Naive RAG is the original formulation: index documents once, embed a query at runtime, retrieve top-k chunks, stuff them into a prompt, generate an answer. No preprocessing of the query, no post-processing of results.
Pipeline:
Documents → chunk → embed → store in vector DB
Query → embed → similarity search → top-k chunks → prompt → LLM → answer
What works: Simple, fast, cheap to build. Handles factual recall well when the corpus is clean and the query is specific.
What breaks: - Poorly phrased queries retrieve irrelevant chunks - Chunks may contain the right information but wrong context (mid-sentence splits) - No handling of complex multi-hop questions - No feedback loop — errors are silent - Context window stuffing: top-k chunks may exceed the LLM's effective attention window
When to use: Prototypes, internal tools with small clean corpora (<10K docs), or when latency budget is very tight and precision requirements are moderate.
Code
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vectorstore = Chroma.from_texts(texts=documents, embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_template(
"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
)
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")
naive_rag = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt | llm | StrOutputParser()
)
answer = naive_rag.invoke("What is the refund policy?")
Advanced RAG
Concept
Advanced RAG adds intelligence at three stages: pre-retrieval, retrieval, and post-retrieval. It fixes Naive RAG's core weaknesses while keeping a single retrieval step.
Pre-retrieval improvements: - Query rewriting — rephrase the query to be more retrieval-friendly - HyDE — generate a hypothetical answer, embed that instead of the raw query - Step-back prompting — generalize a specific question to retrieve broader context first - Multi-query — generate 3–5 query variants, retrieve for each, deduplicate
Retrieval improvements: - Hybrid search — combine dense (semantic) + sparse (BM25) retrieval - Metadata filtering — pre-filter by date, source, category before ANN search - MMR — penalize redundant chunks, increase diversity
Post-retrieval improvements: - Re-ranking — cross-encoder model scores (query, chunk) pairs, promotes best chunks - Contextual compression — extract only the relevant sentences from each retrieved chunk - Chunk stitching — restore surrounding sentences for mid-sentence chunks
When to use: Production systems with real users. The combination of query rewriting + hybrid search + reranking often improves answer quality 20–40% over Naive RAG with modest added latency.
Code
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers.multi_query import MultiQueryRetriever
# Hybrid retriever (dense + sparse)
bm25 = BM25Retriever.from_texts(documents)
dense = vectorstore.as_retriever(search_kwargs={"k": 6})
hybrid = EnsembleRetriever(retrievers=[bm25, dense], weights=[0.4, 0.6])
# Post-retrieval compression
compressor = LLMChainExtractor.from_llm(llm)
advanced_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=hybrid
)
# Pre-retrieval: query expansion
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=advanced_retriever, llm=llm
)
Modular RAG
Concept
Modular RAG (Gao et al., 2023) treats each RAG component as a swappable module. Instead of a fixed pipeline, you compose modules depending on the task:
| Module Type | Options |
|---|---|
| Search | Vector DB, BM25, Google Search, SQL, Knowledge Graph |
| Memory | Short-term (conversation), Long-term (episodic store) |
| Fusion | RRF, linear interpolation, learned weights |
| Routing | Classifier, LLM-based, rule-based |
| Predict | Standard generation, chain-of-thought, self-consistency |
| Task Adaption | Domain fine-tuned retriever, task-specific prompt |
Key insight: Different queries need different retrieval modules. A question about recent news needs web search. A question about internal policy needs a vector DB. A question about product relationships needs a knowledge graph. Modular RAG routes dynamically. This is the architectural parent of Agentic RAG.
Query
↓
Router (classify intent)
├─ "factual/internal" → Vector DB retriever → context → LLM
├─ "recent events" → Web search → snippet → LLM
├─ "structured data" → SQL agent → table → LLM
└─ "relationship" → Graph traversal → paths → LLM
Code
from langchain_core.runnables import RunnableBranch
def route_query(query: str) -> str:
routing_prompt = f"""Classify this query into one of: internal, web_search, sql, graph.
Query: {query}
Classification:"""
return llm.invoke(routing_prompt).content.strip().lower()
pipeline = RunnableBranch(
(lambda x: route_query(x["query"]) == "internal", internal_rag_chain),
(lambda x: route_query(x["query"]) == "web_search", web_search_chain),
(lambda x: route_query(x["query"]) == "sql", sql_agent_chain),
fallback_chain,
)
Multi-Query RAG
Concept
Multi-Query RAG addresses a fundamental limitation of single-query retrieval: a single embedding of the user's question represents one point in vector space, and relevant documents spread across multiple semantic dimensions may not be captured by that single point.
The pattern: use an LLM to generate 3–5 paraphrased or decomposed variants of the original query, retrieve separately for each variant, then deduplicate and merge the results.
When it helps: - Ambiguous queries with multiple valid interpretations - Complex questions that decompose into sub-questions - Queries using informal language when documents use formal terminology
When it doesn't help: - Precise technical queries (e.g., "error code E4013") — variants can't improve on the exact term - Adds latency: N queries × embedding time + N × ANN searches + deduplication
Deduplication strategy: De-duplicate by content hash before reranking. If two variants retrieve the same chunk, keep it once but boost its effective score — found by multiple independent queries signals higher relevance.
Code
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import BaseOutputParser
class LineListParser(BaseOutputParser):
def parse(self, text: str) -> list[str]:
return [line.strip() for line in text.strip().split("\n") if line.strip()]
query_gen_prompt = ChatPromptTemplate.from_template(
"""Generate {n} different versions of this question to improve document retrieval.
Each version should approach the question from a different angle.
Original question: {question}
Output one question per line, no numbering."""
)
def multi_query_retrieve(query: str, n: int = 3) -> list:
variants = (query_gen_prompt | llm | LineListParser()).invoke(
{"question": query, "n": n}
)
all_docs = {}
for variant in [query] + variants:
docs = vectorstore.similarity_search(variant, k=6)
for doc in docs:
doc_id = hash(doc.page_content)
if doc_id not in all_docs:
all_docs[doc_id] = (doc, 1)
else:
all_docs[doc_id] = (doc, all_docs[doc_id][1] + 1)
# Docs found by multiple queries rank higher
return [doc for doc, _ in sorted(all_docs.values(), key=lambda x: -x[1])]
Contextual Compression
Concept
Standard retrieval returns full document chunks regardless of whether every sentence is relevant to the query. Contextual compression post-processes each retrieved chunk to extract only the sentences relevant to the query.
The problem it solves: A 500-character chunk about "refund policy" might contain 4 sentences: one about refunds, one about exchanges, one about store credit, one about contact info. A query about refunds only needs the first sentence. Returning all 4 wastes context window space and can confuse the LLM.
Two types of compressors:
| Compressor | How | Cost | Quality |
|---|---|---|---|
LLMChainExtractor |
LLM extracts relevant sentences | High (LLM call per chunk) | Excellent |
LLMChainFilter |
LLM decides keep/discard per chunk | Medium | Good |
EmbeddingsFilter |
Embedding similarity between chunk and query | Very Low | Moderate |
Practical choice: EmbeddingsFilter with threshold=0.7 is fast and free; LLMChainExtractor is more precise but costs an extra LLM call per retrieved chunk.
Code
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import (
LLMChainExtractor, EmbeddingsFilter, DocumentCompressorPipeline
)
# Option 1: LLM extraction (best quality)
extractor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=extractor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8})
)
# Option 2: Embedding filter (fast, free)
embeddings_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.70
)
fast_retriever = ContextualCompressionRetriever(
base_compressor=embeddings_filter,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8})
)
# Option 3: Pipeline — filter first (cheap), then extract from remaining
pipeline_compressor = DocumentCompressorPipeline(
transformers=[embeddings_filter, extractor]
)
pipeline_retriever = ContextualCompressionRetriever(
base_compressor=pipeline_compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10})
)
Self-RAG
Concept
Self-RAG (Asai et al., 2023) teaches an LLM to adaptively retrieve and critique by generating special reflection tokens alongside regular text. These tokens control the retrieval process.
Reflection tokens:
| Token | Meaning |
|---|---|
| [Retrieve] | Should I retrieve documents for this query? |
| [IsRel] / [IsIRel] | Is the retrieved document relevant? |
| [IsSup] / [IsNotSup] | Does the document support the generated statement? |
| [IsUse] / [IsNotUse] | Is the response useful to the user? |
How it works (inference):
1. LLM generates text until it produces [Retrieve] token
2. If [Retrieve] → fetch documents from retriever
3. LLM produces [IsRel] to evaluate each retrieved doc
4. LLM generates answer segments, producing [IsSup] to verify each claim
5. If [IsNotSup] → retrieve again with a refined query
Key insight: Unlike standard RAG (always retrieves once), Self-RAG retrieves only when needed and performs in-generation verification — reducing unnecessary retrievals for questions the model can answer from parametric knowledge.
Code
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
def self_rag(query: str) -> dict:
# Step 1: Assess if retrieval is needed
retrieval_decision_prompt = ChatPromptTemplate.from_template(
"""Should I retrieve documents to answer this question?
If the answer requires specific facts, recent information, or private knowledge → RETRIEVE
If it's general knowledge or reasoning → NO_RETRIEVE
Question: {query}
Decision (RETRIEVE or NO_RETRIEVE):"""
)
decision = (retrieval_decision_prompt | llm | StrOutputParser()).invoke({"query": query}).strip()
if "NO_RETRIEVE" in decision:
return {"answer": llm.invoke(query).content, "retrieved": False}
docs = retriever.invoke(query)
context = "\n\n".join(doc.page_content for doc in docs)
# Step 2: Assess document relevance
relevance_prompt = ChatPromptTemplate.from_template(
"""Is this context relevant to answering the question?
Question: {query}
Context: {context}
Answer (RELEVANT or IRRELEVANT):"""
)
relevance = (relevance_prompt | llm | StrOutputParser()).invoke(
{"query": query, "context": context[:500]}
).strip()
if "IRRELEVANT" in relevance:
docs = retriever.invoke(f"detailed information about {query}")
context = "\n\n".join(doc.page_content for doc in docs)
# Step 3: Generate with grounding assessment
grounded_gen_prompt = ChatPromptTemplate.from_template(
"""Answer using only the context. After the answer, rate:
FULLY_SUPPORTED, PARTIALLY_SUPPORTED, or NOT_SUPPORTED
Context: {context}
Question: {query}
Answer:
Support level:"""
)
response = (grounded_gen_prompt | llm | StrOutputParser()).invoke(
{"context": context, "query": query}
)
parts = response.split("Support level:")
return {
"answer": parts[0].replace("Answer:", "").strip(),
"support_level": parts[1].strip() if len(parts) > 1 else "unknown",
"retrieved": True,
}
FLARE (Forward-Looking Active REtrieval)
Concept
FLARE (Jiang et al., 2023) addresses a different problem from Self-RAG: during long-form generation, the model may need to retrieve additional information mid-generation to continue accurately.
How it works: 1. Model starts generating a response 2. At each generation step, it produces a "tentative" next sentence with token probabilities 3. If any word has probability < threshold (e.g., 0.2), the model is uncertain 4. Trigger retrieval on the uncertain span 5. Continue generation with the newly retrieved context
When FLARE excels: - Long-form generation (essays, reports) where the model needs fresh context mid-generation - Multi-part questions where each sub-part needs different retrieval
Key distinction from Self-RAG: Self-RAG retrieves at designated decision points; FLARE retrieves on-demand whenever generation uncertainty rises.
Code
def flare_generate(query: str, max_iterations: int = 5) -> str:
generated_text = ""
retrieval_contexts = []
for iteration in range(max_iterations):
prompt = f"""Context: {chr(10).join(retrieval_contexts)}
Previous text: {generated_text}
Continue the response for: {query}
If you're uncertain about any fact, prefix that sentence with [UNCERTAIN]:"""
next_segment = llm.invoke(prompt).content
if "[UNCERTAIN]" in next_segment:
uncertain_part = next_segment.split("[UNCERTAIN]")[1].split(".")[0]
new_docs = retriever.invoke(uncertain_part)
retrieval_contexts.extend([doc.page_content for doc in new_docs[:2]])
continue
generated_text += " " + next_segment
if next_segment.strip().endswith((".", "!", "?")):
break
return generated_text.strip()
Agentic RAG
Concept
Agentic RAG gives an LLM-based agent autonomy to decide when to retrieve, what to retrieve, and whether to retrieve again after seeing an initial result. This is a fundamental shift from pipeline RAG (fixed sequence) to agent RAG (dynamic control flow).
Differences from pipeline RAG:
| Dimension | Pipeline RAG | Agentic RAG |
|---|---|---|
| Retrieval timing | Always, at start | Agent decides |
| Number of retrievals | One | One to many |
| Fallback behavior | None | Agent retries with different query |
| Multi-hop | Not native | Natural via tool calls |
| Latency | Predictable | Variable |
| Complexity | Low | High |
Patterns within Agentic RAG: - ReAct — interleave reasoning and retrieval actions - Plan-and-solve — decompose into sub-questions, retrieve for each - Self-RAG — retrieval triggered by reflection tokens (see above) - CRAG — evaluate retrieval quality, fall back to web search if poor (see below)
ReAct trace:
Thought: I need to find Q3 revenue
Action: retrieve("Q3 2024 revenue")
Observation: [retrieved chunks — Q3 = $4.2B]
Thought: I found Q3, now I need Q4 and the reason for the change
Action: retrieve("Q4 2024 revenue cause of change")
Observation: [retrieved chunks]
Thought: I now have enough information
Action: generate_final_answer
When to use: Multi-hop questions, tasks requiring synthesis across multiple documents, when the retrieval plan cannot be determined statically.
Code
from langgraph.graph import StateGraph, END
from typing import TypedDict, Optional
class AgentState(TypedDict):
query: str
retrieved_docs: list[str]
answer: Optional[str]
needs_more_retrieval: bool
def retrieve_node(state: AgentState) -> AgentState:
docs = retriever.invoke(state["query"])
return {"retrieved_docs": [d.page_content for d in docs]}
def reason_node(state: AgentState) -> AgentState:
context = "\n".join(state["retrieved_docs"])
response = llm.invoke(
f"Context: {context}\nQuery: {state['query']}\n"
f"If you can answer confidently, start with ANSWER:. "
f"If you need more info, start with RETRIEVE_MORE: and give a follow-up query."
).content
if response.startswith("ANSWER:"):
return {"answer": response[7:].strip(), "needs_more_retrieval": False}
new_query = response.replace("RETRIEVE_MORE:", "").strip()
return {"query": new_query, "needs_more_retrieval": True}
builder = StateGraph(AgentState)
builder.add_node("retrieve", retrieve_node)
builder.add_node("reason", reason_node)
builder.add_edge("retrieve", "reason")
builder.add_conditional_edges("reason",
lambda s: "retrieve" if s["needs_more_retrieval"] else END,
{"retrieve": "retrieve", END: END}
)
builder.set_entry_point("retrieve")
graph = builder.compile()
# With LangChain tools API
from langchain.tools.retriever import create_retriever_tool
from langchain.agents import AgentExecutor, create_react_agent
retriever_tool = create_retriever_tool(
retriever=hybrid_retriever,
name="knowledge_search",
description="Search the knowledge base for relevant documents."
)
agent = create_react_agent(llm=llm, tools=[retriever_tool], prompt=react_prompt)
agent_executor = AgentExecutor(agent=agent, tools=[retriever_tool], max_iterations=5)
GraphRAG
Concept
GraphRAG (Microsoft Research, 2024) replaces or supplements the flat vector index with a knowledge graph. For questions that span many documents or require reasoning about relationships between entities, a graph index captures structure that a vector index cannot.
Build pipeline:
Document corpus
↓ LLM extracts entities and relationships from each chunk
Entity-Relationship triples: (Entity_A, relation, Entity_B)
↓ Build knowledge graph (nodes=entities, edges=relations)
↓ Run community detection (Leiden algorithm)
Communities (groups of related entities)
↓ LLM generates summary for each community at multiple levels
Hierarchical community summaries (C0 = coarse, C4 = fine-grained)
Two retrieval modes: - Local search — relevant entities, relationships, and community summaries for a specific query - Global search — broad community-level summaries to answer thematic questions
Why GraphRAG beats dense for relationship queries:
Example: "Which employees worked at both Google and Apple?" - Dense RAG: retrieves chunks mentioning Google or Apple — can't traverse the "employee of" relationship across documents - GraphRAG: entity "Alice Chen" is a node; edges "worked_at" connect to both "Google" and "Apple"; a graph traversal directly answers this
Cost reality: Indexing costs ~$1–5 per 1M tokens (LLM calls for entity extraction). Justified only when relationship traversal or global synthesis is a core query pattern.
Code
# Using the official Microsoft graphrag library
# pip install graphrag
import asyncio
from graphrag.query.cli import run_local_search, run_global_search
# Local search (specific entity questions)
result = asyncio.run(run_local_search(
config_dir="./my_corpus",
query="What products did Tesla announce in Q3 2024?"
))
# Global search (broad thematic questions)
result = asyncio.run(run_global_search(
config_dir="./my_corpus",
query="What are the recurring safety concerns mentioned across all reports?"
))
# DIY with NetworkX (smaller corpora)
import networkx as nx
def build_knowledge_graph(chunks: list[str], llm) -> nx.DiGraph:
G = nx.DiGraph()
for chunk in chunks:
triples_text = llm.invoke(
f"Extract entity-relationship triples. Format: entity1 | relation | entity2 (one per line)\nText: {chunk}"
).content
for line in triples_text.strip().split("\n"):
parts = [p.strip() for p in line.split("|")]
if len(parts) == 3:
G.add_edge(parts[0], parts[2], relation=parts[1])
return G
def graph_retrieve(G: nx.DiGraph, query: str, k_hops: int = 2) -> list[str]:
entities = llm.invoke(f"List named entities in: '{query}'. One per line.").content.split("\n")
relevant_nodes = set()
for entity in entities:
if entity.strip() in G:
neighborhood = nx.ego_graph(G, entity.strip(), radius=k_hops)
relevant_nodes.update(neighborhood.nodes())
return [
f"{u} {data['relation']} {v}"
for u, v, data in G.edges(data=True)
if u in relevant_nodes or v in relevant_nodes
]
Corrective RAG (CRAG)
Concept
CRAG (Yan et al., 2024) adds a quality gate after retrieval: evaluate whether the retrieved documents are actually relevant to the query. If not, fall back to web search instead of forcing the generator to work with irrelevant context.
The problem it solves: Standard RAG always uses retrieved documents, even if they're irrelevant. The LLM then hallucinates or says "I don't know" — wasting the query. CRAG detects retrieval failure and has a recovery strategy.
CRAG states:
- CORRECT — retrieved docs are highly relevant → use directly
- INCORRECT — retrieved docs are irrelevant → fall back to web search
- AMBIGUOUS — partial relevance → use retrieved docs + web search, merge
Code
from langgraph.graph import StateGraph, END
from typing import TypedDict, Optional
class CRAGState(TypedDict):
query: str
retrieved_docs: list[str]
retrieval_grade: str
web_results: list[str]
final_context: list[str]
answer: Optional[str]
def grade_retrieval(state: CRAGState) -> CRAGState:
grade = llm.invoke(
f"""Are these documents relevant to the query?
Query: {state['query']}
Documents: {state['retrieved_docs'][:2]}
Answer: CORRECT, INCORRECT, or AMBIGUOUS"""
).content.strip()
return {"retrieval_grade": grade}
def web_search_node(state: CRAGState) -> CRAGState:
from langchain_community.tools import TavilySearchResults
results = TavilySearchResults(k=3).invoke(state["query"])
return {"web_results": [r["content"] for r in results]}
def assemble_context(state: CRAGState) -> CRAGState:
if state["retrieval_grade"] == "CORRECT":
return {"final_context": state["retrieved_docs"]}
elif state["retrieval_grade"] == "INCORRECT":
return {"final_context": state["web_results"]}
return {"final_context": state["retrieved_docs"] + state["web_results"]}
builder = StateGraph(CRAGState)
builder.add_node("retrieve", lambda s: {"retrieved_docs": [d.page_content for d in retriever.invoke(s["query"])]})
builder.add_node("grade_docs", grade_retrieval)
builder.add_node("web_search", web_search_node)
builder.add_node("assemble", assemble_context)
builder.add_node("generate", lambda s: {"answer": llm.invoke(
f"Context: {s['final_context']}\nQuestion: {s['query']}"
).content})
builder.set_entry_point("retrieve")
builder.add_edge("retrieve", "grade_docs")
builder.add_conditional_edges("grade_docs",
lambda s: "web_search" if s["retrieval_grade"] in ("INCORRECT", "AMBIGUOUS") else "assemble",
{"web_search": "web_search", "assemble": "assemble"}
)
builder.add_edge("web_search", "assemble")
builder.add_edge("assemble", "generate")
builder.add_edge("generate", END)
crag = builder.compile()
Multimodal RAG
Concept
Multimodal RAG retrieves across image, text, table, and audio modalities. The key challenges are: 1. Embedding heterogeneous content — images and text need to share the same embedding space 2. Cross-modal queries — a text query should retrieve relevant images and vice versa
Approaches:
| Approach | How It Works | When to Use |
|---|---|---|
| Caption-based | Extract text captions from images, embed captions | Fast, low cost; loses visual detail |
| CLIP embeddings | Shared image+text embedding space | Good for photo/diagram retrieval |
| ColPali | Late interaction: each image patch gets a vector | Best precision for document images |
| GPT-4V / Gemini summaries | LLM describes image content, embed description | High quality, higher cost |
ColPali (2024) is state-of-the-art for document image retrieval — it embeds entire page images as a bag of patch vectors and uses MaxSim (late interaction) to score (query, page) pairs without OCR.
Code (Caption-based)
import base64
def image_to_base64(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
def extract_image_caption(image_path: str) -> str:
b64 = image_to_base64(image_path)
return llm.invoke([
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
{"type": "text", "text": "Describe this image for retrieval. Include key entities, numbers, and concepts."}
]).content
# Build multimodal index
captions = {path: extract_image_caption(path) for path in image_paths}
vectorstore = Chroma.from_texts(
texts=list(captions.values()),
metadatas=[{"source": k, "type": "image"} for k in captions],
embedding=embeddings
)
Comparison Matrix
| Type | Retrieval Complexity | Build Cost | Query Cost | Latency | Best For |
|---|---|---|---|---|---|
| Naive RAG | Single ANN search | Low | Very Low | ~100ms | Prototypes, simple Q&A |
| Advanced RAG | Hybrid + rerank | Low | Low–Medium | ~300–600ms | Production Q&A systems |
| Modular RAG | Multiple retrievers | Medium | Medium | ~400–800ms | Diverse query types |
| Agentic RAG | Dynamic multi-step | Low | High | ~1–5s | Multi-hop, complex reasoning |
| GraphRAG | Graph traversal | Very High | Very High | ~2–10s | Relationship queries, synthesis |
| Multimodal RAG | Cross-modal ANN | High | Medium | ~500ms–2s | Image+text corpora |
Extended Variant Taxonomy
Beyond the six types above, production systems exhibit specialized patterns. These are named variants — most are specializations of Agentic or Advanced RAG.
| Variant | What It Does | Distinguishing Feature | Best For |
|---|---|---|---|
| Corrective RAG (CRAG) | Meta-evaluates retrieved chunks; falls back to web search if below threshold | Self-correcting retrieval loop | High-stakes domains where bad retrieval is worse than "I don't know" |
| Speculative RAG | Pre-fetches potential follow-up chunks before the user asks | Proactive, predictive retrieval | Customer support, FAQ flows with predictable patterns |
| Self-RAG | Feeds generated responses back through retrieval iteratively | Iterative self-refinement cycle | Complex multi-part answers |
| MEMO / Memory-Augmented RAG | Uses long-term conversational memory as retrieval context | Persistent episodic memory | Multi-session agents, personalized assistants |
| REALM-style RAG | Retriever jointly pre-trained with LLM via masked LM | End-to-end retriever training | When retriever quality is the dominant bottleneck |
| Cost-Constrained RAG | Dynamically skips reranking, reduces k, or uses cheaper embeddings under budget | Explicit cost-quality trade-off | High-volume products with tight per-query cost budgets |
| Adaptive RAG | Adjusts retrieval strategy (k, chunk size, model) based on query complexity | Strategy adapts per-query | Long conversations where early turns are simple, later turns complex |
| Streaming / Real-Time RAG | Continuously ingests new data; index always fresh | Sub-second index freshness | News feeds, live financial data, incident response |
When to choose each: - CRAG over Agentic RAG: when your retrieval quality is unreliable and a wrong context causes more harm than a fallback - Streaming RAG over standard RAG: when a retrieval lag of even 5 minutes is unacceptable - MEMO-style RAG: when users have many sessions and prior context should influence retrieval
Q&A Review Bank
Q: What's the difference between Advanced RAG and Modular RAG? [Medium]
A: Advanced RAG improves a fixed pipeline at pre/retrieval/post stages but keeps the same structure. Modular RAG deconstructs the pipeline entirely into swappable components and adds a routing layer that selects different retrieval modules (web, SQL, graph, vector DB) based on query classification. Modular RAG is the generalization that subsumes Advanced RAG.
Q: How does Self-RAG differ from standard RAG? [Medium]
A: Standard RAG always retrieves before generating — one retrieval, then one generation. Self-RAG uses a fine-tuned model that generates special reflection tokens ([Retrieve], [IsRel], [IsSup]) to decide dynamically: should I retrieve at all? Is this retrieved document relevant? Does this generated claim have support from the retrieved context? Self-RAG retrieves only when needed, validates each retrieval, and checks factual support mid-generation — resulting in fewer hallucinations and unnecessary retrievals. The downside is requiring a specially fine-tuned model.
Q: What is the core architectural difference between pipeline RAG and Agentic RAG? [Medium]
A: Pipeline RAG has a fixed control flow — retrieve once, generate once. Agentic RAG gives an LLM agent the ability to dynamically decide when to retrieve, what query to use, and whether to retrieve again based on intermediate results. Pipeline RAG has predictable latency; Agentic RAG has variable latency but handles multi-hop and ambiguous queries naturally.
Q: When would GraphRAG outperform dense vector retrieval? [Hard]
A: Three scenarios where graph traversal wins: (1) Multi-entity relationship queries — "Find all companies that both person X and person Y have worked at" requires graph traversal, not vector similarity. (2) Global thematic synthesis — "Summarize the main trends across 10,000 earnings reports" — GraphRAG community summaries aggregate information across the entire corpus. (3) Causal chain queries — "What led to Event A?" — knowledge graph edges explicitly represent causal relationships that are only implicit in flat text.
Q: What is FLARE and how does it decide when to retrieve? [Hard]
A: FLARE monitors token-level generation probabilities during forward pass. When a generated token has probability below a threshold (e.g., 0.2), the model is uncertain about that fact. FLARE pauses generation, uses the uncertain span as a retrieval query, fetches relevant documents, then continues generation with the new context. This enables mid-generation retrieval rather than front-loading all retrieval before generation starts. Best for long-form generation where information needs emerge as the model generates.
Q: What is Corrective RAG and why is it important for reliability? [Medium]
A: CRAG adds a retrieval quality gate: after retrieving documents, an LLM evaluates whether they actually address the query. If not (grade=INCORRECT), it falls back to web search rather than forcing the generator to work with irrelevant context. This matters because standard RAG silently fails — irrelevant retrieval leads to hallucination without any signal. CRAG makes retrieval failure explicit and provides a recovery path. The trade-off: adds latency (one extra LLM grading call + possible web search).
Q: Your multi-query RAG is adding too much latency. How do you optimize it? [Medium]
A: Three optimizations: (1) Parallelize retrievals — run all N variant queries concurrently using asyncio.gather() instead of sequentially; N variants takes the same wall-clock time as 1. (2) Cache embeddings — if the same query or variant reappears, the embedding is already computed. (3) Reduce N — 2–3 variants often gives 80% of the benefit of 5; empirically test on your eval set. (4) Diversify, don't duplicate — generate variants using MMR on query embeddings to ensure they cover different semantic regions.
Q: How does Contextual Compression help with the "lost in the middle" problem? [Medium]
A: The "lost in the middle" effect means LLMs perform worse when relevant information is buried in the middle of a long context window. Contextual compression reduces each chunk to only its relevant sentences before assembly — fewer total tokens means the relevant information is proportionally more prominent, less total context to "get lost in," and relevant spans are more likely to land near the beginning or end of the assembled context.
Q: Explain the trade-off between Agentic RAG and Pipeline RAG for a production system. [Hard]
A: Pipeline RAG gives predictable sub-2 second latency, is easy to test and monitor — appropriate for 80% of queries that are straightforward factual lookups. Agentic RAG handles the remaining 20% that require multi-step reasoning but adds variable latency (2–15 seconds) and is harder to debug deterministically. The production pattern is often hybrid: fast pipeline RAG as the primary path, Agentic RAG triggered only when the pipeline's confidence score is low or the query classifier detects a multi-hop pattern.
Q: How does ColPali differ from CLIP for image retrieval? [Hard]
A: CLIP produces one embedding per image (a global vector), losing spatial and layout information. ColPali uses a Vision Language Model to produce per-patch embeddings (typically 1024 patches per page image), then scores (query, image) pairs using MaxSim — the max similarity over all patch-query pairs. This preserves layout information critical for document page retrieval (charts, tables, diagrams) and achieves significantly higher precision on document QA benchmarks without requiring OCR.
Q: What would make you choose Naive RAG over Advanced RAG for a production system? [Easy]
A: Latency constraints (Naive RAG is 3–5× faster), cost constraints (no reranking API calls), or when the corpus is small, clean, and queries are simple factual lookups where the extra precision of reranking provides negligible benefit. Naive RAG is also easier to debug and monitor. Start with Naive RAG, measure quality gaps, then add Advanced RAG components only where evaluation shows measurable improvement.