RAG Evaluation and Failure Modes
Common Failure Modes
Concept
RAG failures fall into three categories: retrieval failures, generation failures, and pipeline failures. Understanding the taxonomy is critical for both building and debugging RAG systems.
RAG Failure Taxonomy
├── Generation Failures
│ ├── Hallucination despite context (faithfulness failure)
│ ├── Ignoring retrieved context (context utilization failure)
│ ├── Verbatim copying without synthesis (over-reliance)
│ └── Lost in the middle (position bias)
├── Retrieval Failures
│ ├── Wrong chunks retrieved (relevance failure)
│ ├── Relevant chunks not retrieved (recall failure)
│ ├── Correct chunks retrieved but wrong order (ranking failure)
│ └── No chunks retrieved (index/query failure)
└── Pipeline Failures
├── Chunk boundary bisects key information (chunking failure)
├── Index is stale (freshness failure)
├── Query-document vocabulary mismatch (lexical gap)
└── Context window overflow (too many chunks stuffed)
Retrieval failures: - Relevance failure: The ANN search returns chunks that are semantically adjacent but not actually relevant. Example: query about "Python snake" retrieves programming docs instead of zoology docs in a mixed corpus. - Recall failure: The relevant chunk exists in the index but isn't in top-k. Fix: increase k, add hybrid search, improve embedding model, improve chunking. - Ranking failure: Right chunks retrieved but ranked below irrelevant ones. Fix: add reranking.
Generation failures: - Faithfulness failure: The LLM generates claims not supported by the retrieved context. Most insidious — the answer looks correct but contains invented details. - Context utilization failure: The LLM has the right context but ignores it, preferring its parametric memory. Common with GPT-3.5 / older models on highly contested facts. - Lost in the middle: Extensive research (Liu et al., 2023) shows performance degrades significantly for relevant content placed in the middle of a long context. Position 0 and position N (last) are best attended to.
8-Dimension RAG Evaluation Framework
Concept
RAGAS covers the core four metrics, but production RAG evaluation requires a broader lens. This 8-dimension framework maps every failure mode to its measurement tool.
| Dimension | What It Measures | Tools | Key Metrics |
|---|---|---|---|
| Retrieval Quality | Are the right chunks being found? Covers precision, recall, and ranking order of retrieved context | RAGAS (context_precision, context_recall), custom Recall@k test sets |
Context Precision, Context Recall, NDCG@k, MRR |
| Groundedness / Faithfulness | Are generated claims supported by retrieved context? Detects fabricated or unsupported assertions | RAGAS (faithfulness), TruLens, LangSmith, FactCC |
Faithfulness score (0–1), Claim Support Rate |
| Hallucination Detection | Identifies model-generated content that contradicts the source or knowledge base | SelfCheckGPT (multi-sample consistency), GPTScore, NLI-based classifiers | Hallucination Rate, Contradiction Score |
| Answer Quality | Measures relevance, completeness, and clarity of the final answer | RAGAS (answer_relevancy), G-Eval / LLM-as-Judge, BERTScore, ROUGE-L |
Answer Relevancy, ROUGE-L, BERTScore F1, Human BLEU |
| Robustness | How well does the system handle adversarial inputs, ambiguous queries, and out-of-distribution questions? | Adversarial test suites, perturbation testing | Accuracy under query paraphrase, accuracy under context noise |
| End-to-End Task Metrics | Task-specific correctness — F1, Exact Match, pass@k for code, accuracy for QA benchmarks | Dataset-specific (SQuAD F1, HotpotQA EM, HumanEval pass@k) | Exact Match, F1, pass@1, NDCG for ranking tasks |
| Latency & Cost | P50/P95 TTFT, tokens per query, embedding cost, rerank cost | Langfuse, LangSmith, Arize Phoenix, Cloud Trace | P95 latency, cost per query, cache hit rate |
| User Feedback Loop | Real-world quality signal — thumbs up/down, escalations, follow-up clarifications | Thumbs up/down UI, support ticket analysis, explicit feedback collection | Negative feedback rate, clarification rate, resolution rate |
How to prioritize dimensions in practice:
| Application Type | Primary Metric | Secondary Metric |
|---|---|---|
| Legal / Medical / Financial Q&A | Faithfulness (hallucination is catastrophic) | Retrieval Recall |
| Customer Support Chatbot | Answer Quality + User Feedback | Latency |
| Code Assistant | End-to-End Task (pass@k) | Robustness |
| Internal Knowledge Search | Retrieval Quality | Latency & Cost |
| Real-time Streaming RAG | Latency & Cost | Groundedness |
Recommended evaluation stack per dimension: - Retrieval: RAGAS + hand-labeled golden test set (200–500 (q, expected_chunk) pairs) - Faithfulness: RAGAS faithfulness + FactCC for claim-level verification - Hallucination: SelfCheckGPT (sample 5 responses, check internal consistency) - Answer quality: LLM-as-Judge (G-Eval) + human spot-checks for 5% of traffic - Latency: Langfuse or Arize Phoenix for distributed tracing with per-stage spans - User feedback: Thumbs up/down in UI → store to evaluation dataset → weekly RAGAS re-evaluation
RAGAS Metrics
Concept
RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework. It measures four dimensions, each using LLM-based scoring:
1. Faithfulness — Does the generated answer only contain claims that are supported by the retrieved context?
Implementation: extract all factual claims from the answer using an LLM, then for each claim, ask a judge LLM if the context supports it. Average across claims.
Score = 1.0: every claim in the answer is directly traceable to the context Score = 0.0: answer is entirely fabricated or contradicts the context
2. Answer Relevancy — How relevant is the generated answer to the user's question?
Implementation: prompt an LLM to generate N questions that the answer would answer, embed those generated questions, compute cosine similarity with the original question. If the answer is relevant to the question, the reverse-generated questions should be similar to the original.
3. Context Precision — Of the retrieved chunks, what fraction are actually relevant to the question?
Measures retrieval precision. A high context precision means the retriever isn't adding noise.
4. Context Recall — Do the retrieved chunks cover all the information needed to answer the question?
Context Recall = (sentences in ground truth answer attributable to retrieved context) / (total sentences in ground truth)
Requires a ground truth answer. Measures whether retrieval "missed" any necessary information.
Which metric matters most? - Faithfulness: most important for factual/legal/medical applications where hallucination has real consequences - Context Recall: most important during development to diagnose retrieval gaps - Answer Relevancy: important for conversational applications where the answer should stay on-topic - Context Precision: important for cost optimization (fewer chunks = lower token cost)
Code
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from ragas.metrics.critique import harmfulness
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": ["What is the refund policy?", "How long is the warranty?"],
"answer": ["Customers can get a full refund within 30 days.", "The warranty lasts 2 years."],
"contexts": [
["Refunds are available within 30 days of purchase for all plans..."], # retrieved chunks
["All products come with a 2-year limited warranty..."]
],
"ground_truth": [
"Customers are eligible for a full refund within 30 days.", # optional: for recall
"The product warranty covers 2 years from purchase date."
]
}
dataset = Dataset.from_dict(eval_data)
result = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
# Output: {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.9, 'context_recall': 0.85}
# Manual faithfulness check (useful when RAGAS is overkill)
def check_faithfulness(answer: str, context: str, llm) -> float:
prompt = f"""Extract all factual claims from the answer. For each claim,
determine if it is SUPPORTED or NOT_SUPPORTED by the context.
Context: {context}
Answer: {answer}
For each claim, write: CLAIM: <claim> | VERDICT: <SUPPORTED/NOT_SUPPORTED>"""
response = llm.invoke(prompt).content
lines = [l for l in response.split("\n") if "VERDICT:" in l]
if not lines:
return 0.0
supported = sum(1 for l in lines if "SUPPORTED" in l and "NOT_SUPPORTED" not in l)
return supported / len(lines)
LLM-as-Judge
Concept
Beyond RAGAS's automated metrics, LLM-as-Judge uses an LLM to evaluate response quality on dimensions that are hard to formalize mathematically: clarity, completeness, tone, citation quality.
G-Eval framework: Define evaluation criteria as natural language rubrics, then ask an LLM to score (1-5) on each criterion. More reliable when using chain-of-thought before scoring.
Key considerations for LLM judges: - Use a stronger model as judge than the model being evaluated (use GPT-4o to judge GPT-3.5) - Self-evaluation is biased — a model tends to prefer its own outputs - Position bias: LLMs tend to favor the first response in pairwise comparison - Verbosity bias: LLMs tend to prefer longer, more detailed answers
When to use LLM-as-Judge vs RAGAS: - RAGAS: automated CI/CD gates on well-defined metrics (faithfulness, recall) - LLM-as-Judge: qualitative human-preference-aligned evaluation, A/B testing new retrieval strategies
Code
def g_eval_rag(question: str, context: str, answer: str, judge_llm) -> dict:
prompt = f"""You are evaluating a RAG system's response. Score each criterion 1-5.
Question: {question}
Retrieved Context: {context}
Generated Answer: {answer}
Evaluate:
1. Faithfulness (1=completely fabricated, 5=every claim supported by context)
2. Relevance (1=completely off-topic, 5=directly and completely answers the question)
3. Completeness (1=major information missing, 5=all aspects addressed)
4. Clarity (1=confusing/misleading, 5=clear and precise)
First explain your reasoning for each, then provide scores in this format:
Faithfulness: X/5
Relevance: X/5
Completeness: X/5
Clarity: X/5"""
response = judge_llm.invoke(prompt).content
scores = {}
for metric in ["Faithfulness", "Relevance", "Completeness", "Clarity"]:
import re
match = re.search(rf"{metric}: (\d)/5", response)
if match:
scores[metric.lower()] = int(match.group(1)) / 5.0
return scores
Tracing and Observability
Concept
You cannot improve what you cannot measure. Production RAG systems need end-to-end tracing to diagnose latency, quality, and cost issues.
What to instrument:
Per request, trace:
├── query_id: unique ID for correlation
├── query: original user query
├── preprocessed_query: after rewriting/expansion
├── retrieval_latency: time to retrieve
├── retrieved_docs: {id, score, content_preview}[]
├── rerank_latency: if reranking used
├── reranked_docs: {id, score}[]
├── llm_latency: time to first token
├── total_tokens: prompt + completion tokens (cost tracking)
├── answer: final generated text
└── user_feedback: thumbs up/down if available
Tools:
| Tool | Type | Features |
|---|---|---|
| LangSmith | LangChain native | Full trace, LLM call inspection, eval datasets |
| Arize Phoenix | Open source | Local + cloud, RAGAS integration, drift detection |
| Vertex AI Tracing | GCP native | Integration with Cloud Trace, spans per service |
| W&B Weave | Weights & Biases | Experiment tracking + LLM tracing |
| Langfuse | Open source | EU-hosted option, RAGAS integration |
Key metrics to monitor in production:
- retrieval_latency_p95: alert if > 100ms
- answer_quality_score: sampled RAGAS faithfulness, alert if drops > 5% from baseline
- cache_hit_rate: alert if drops below 20% (indicates query distribution shift)
- token_cost_per_query: budget tracking
- user_negative_feedback_rate: thumbs down / complaint rate
Code
# LangSmith tracing (minimal setup)
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY")
os.environ["LANGCHAIN_PROJECT"] = "rag-production"
# All LangChain calls are now automatically traced — no code changes needed
# Custom structured logging for non-LangChain components
import structlog
import time
log = structlog.get_logger()
def traced_retrieve(query: str, retriever) -> tuple[list, float]:
start = time.perf_counter()
docs = retriever.invoke(query)
latency = time.perf_counter() - start
log.info("retrieval_complete",
query=query[:100],
num_docs=len(docs),
latency_ms=round(latency * 1000, 2),
top_doc_preview=docs[0].page_content[:100] if docs else None,
doc_sources=[d.metadata.get("source") for d in docs]
)
return docs, latency
# Arize Phoenix (open source observability)
import phoenix as px
from phoenix.trace.langchain import LangChainInstrumentor
px.launch_app() # starts local Phoenix server
LangChainInstrumentor().instrument() # auto-instruments all LangChain calls
A/B Testing RAG
Concept
To improve a RAG system, you need to know when a change is actually better. A/B testing RAG requires:
Offline evaluation (pre-deployment): Maintain a golden dataset of 100-500 (question, expected_answer, expected_source) tuples from production traffic. Before deploying a change, run both old and new pipelines against the golden set, compare RAGAS scores. Deploy only if new ≥ old on primary metric (typically faithfulness or Recall@5).
Online evaluation (shadow mode): Run new pipeline in parallel with production pipeline but only log results (don't serve to users). Compare quality scores from the shadow pipeline vs production.
Metric selection: Choose one primary metric for go/no-go decisions. Faithfulness for factual applications. Answer relevancy for conversational. Recall@5 for retrieval-focused improvements.
When RAG Is NOT the Answer
Concept
RAG is not always the right solution. Over-applying it wastes cost and latency.
| Scenario | Why RAG Fails | Better Approach |
|---|---|---|
| Structured data queries | "Show me all orders over $1000" — semantic search is wrong tool | SQL / text-to-SQL |
| Real-time computation | "What is 15% of $847?" — LLM should compute, not retrieve | Calculator tool, code execution |
| Tiny static corpus | 10 FAQ answers that never change | Stuff all FAQs into system prompt (no retrieval needed) |
| Very low latency (< 200ms) | RAG adds 100-500ms minimum | Fine-tuning or few-shot in system prompt |
| All answers from one document | A report that is always the full context | Summarization chain, not RAG |
| Private numerical data | Customer's account balance — lookup, not retrieval | Direct database query |
Interview Q&A
Q: What is RAGAS faithfulness and how is it computed? [Medium]
A: Faithfulness measures what fraction of factual claims in the generated answer are supported by the retrieved context. Computation: (1) extract all factual claims from the answer using an LLM, (2) for each claim, prompt a judge LLM with both the claim and the context, asking if the context supports the claim, (3) faithfulness = supported claims / total claims. A score of 1.0 means the answer only contains claims directly traceable to retrieved context; 0.0 means the answer is entirely fabricated or contradicts the context.
Q: How do you debug a RAG system where faithfulness is low (0.4) but context recall is high (0.9)? [Hard]
A: High context recall means the retriever IS finding relevant documents (90% of needed information is present). Low faithfulness means the LLM is ignoring the retrieved context and generating unsupported claims. Diagnoses: (1) check if the LLM's system prompt instructs it strongly enough to use only the provided context — strengthen the grounding instruction. (2) Test if the model ignores context for specific topics it has strong parametric priors about — the model may "know" a wrong answer and prefer it. (3) Check position — is the relevant context placed in the middle of a long context? Try reordering so highest-scored chunks appear first. (4) Check if retrieved chunks contain contradictory information — the model may blend both versions.
Q: You have no labeled data for evaluation. How do you set up a RAG evaluation pipeline? [Hard]
A: Use synthetic ground truth generation: (1) take 200 random document chunks from your corpus, (2) for each chunk, use an LLM to generate 2-3 realistic questions a user might ask that the chunk can answer, (3) use the same LLM to generate the expected answer from the chunk alone — this becomes your "ground truth." You now have 400-600 (question, context, expected_answer) tuples. Run RAGAS on these. Caveat: synthetic test sets are biased toward content the LLM can answer from the chunks — test on real user queries as soon as you have production traffic.
Q: What is the "lost in the middle" problem and how do you mitigate it? [Medium]
A: Studies show that LLMs perform significantly worse when the relevant information is placed in the middle of a long context window, compared to beginning or end positions. With k=5 retrieved chunks, position 0 and position 4 are best attended; positions 1-3 are less reliably used. Mitigations: (1) Reranking + ordering — use a cross-encoder reranker, then place the highest-scored chunks at position 0 and 4 (top and bottom), lower-scored ones in the middle. (2) Contextual compression — reduce each chunk to only its relevant sentences, so total context is shorter and the problem is less severe. (3) Fewer, higher-quality chunks — k=3 with high-precision reranking often outperforms k=8 without it.
Q: What's the difference between context precision and context recall in RAGAS? [Easy]
A: Context precision measures the fraction of retrieved chunks that are relevant to the question — it's a retriever quality metric about noise. Context recall measures whether the retrieved chunks contain all the information needed to answer the question — it's a retriever quality metric about completeness. High context precision + low context recall: the retriever returns only relevant chunks but misses some necessary ones (try increasing k or improving hybrid search). Low context precision + high context recall: the retriever returns all necessary information but also a lot of noise (try better reranking or lower k).
Q: When should you NOT use RAG? [Easy]
A: RAG adds latency and cost unnecessarily when: (1) the entire knowledge base fits in a system prompt (10 FAQ entries — just include them all); (2) queries are computationally answerable (math, SQL queries over structured data — use tools, not retrieval); (3) latency requirements are very tight (<200ms) and the corpus is static and small enough to fine-tune; (4) every query needs the full document (legal contract review, financial report analysis — pass the full document, not retrieved snippets).