Back to Field Notes
FIELD NOTE

RAG failure patterns I see repeatedly

3 min read

Context

RAG systems fail in predictable ways. Most teams focus on retrieval accuracy and ignore failure modes that only surface at scale or under adversarial queries.

Failure pattern 1: Citation drift

Symptom

The LLM cites a document that exists, but the cited page/section does not actually support the answer.

Why it happens

  • Retrieval returns a relevant document, but the LLM grounds on the wrong section.
  • Long documents increase chance of spurious grounding.
  • No validation that citation IDs appear in the final answer.

Mitigation

  • Post-generation citation validation: Extract cited IDs and verify they appear in the answer.
  • Chunking with boundaries: Preserve section headers; avoid splitting logical blocks.
  • Grounding prompts: Force the model to quote before citing.

Failure pattern 2: Permission-aware retrieval gaps

Symptom

Users with restricted access get empty results or “I don’t know,” even when relevant documents exist in other tenants.

Why it happens

  • Embedding index is global; tenant filters applied after retrieval.
  • Reranker downweights restricted content.
  • No fallback to explain permission boundaries.

Mitigation

  • Tenant-scoped embeddings: Isolate vectors per tenant.
  • Access-aware rerank: Boost allowed content before hybrid scoring.
  • Graceful fallback: Respond with “You don’t have access to X; request access” instead of silence.

Failure pattern 3: Temporal staleness blindness

Symptom

Answers reference outdated policies or versions, confusing users.

Why it happens

  • No version metadata in chunks.
  • Retrieval scores ignore recency.
  • LLM defaults to most recent context without knowing recency.

Mitigation

  • Time-decayed scores: Penalize older chunks in reranking.
  • Version tags in metadata: Filter by effective date.
  • Explicit recency prompt: Tell the model the current date and document horizon.

Failure pattern 4: Multi-hop collapse

Symptom

Questions requiring reasoning across two documents return answers from only one.

Why it happens

  • Single-hop retrieval: only fetch top-k for one query.
  • No query decomposition.
  • LLM hallucinates cross-document links.

Mitigation

  • Query decomposition: Break multi-hop into sub-queries.
  • Iterative retrieval: Retrieve, then ask “what else do you need?”
  • Graph-based grounding: Build small document graphs for complex queries.

Trade-offs I accept

  • Latency vs. validation: Post-generation citation checks add 100–200ms but are non-negotiable for auditability.
  • Storage vs. tenant isolation: Multi-tenant embeddings increase storage ~2x but prevent data leaks.
  • Complexity vs. multi-hop: Iterative retrieval adds roundtrips but increases answer completeness.

Takeaway

RAG reliability is a systems problem. You need validation, tenant-awareness, recency, and multi-hop strategies - not just better vectors.