What RAG Actually Is
RAG — retrieval-augmented generation — is the pattern where the model answers questions using passages retrieved from your data, not its training. Done right: factual, grounded, citation-backed answers. Done wrong: a chatbot that hallucinates with extra steps.
Data Preparation
Most RAG failures start before any vector is computed. Your source data has duplicates, conflicting versions, outdated pages, and PDFs full of garbage. Clean and canonicalize first. The fancy retrieval can't fix bad data.
Chunking Strategy
Default token-based chunking destroys context. Use semantic chunking: break on H2/H3, paragraphs, or list items. Each chunk should contain one coherent idea. Add metadata (document title, section, last-updated) to every chunk.
Embedding Choice
Modern embedding models are commoditized. OpenAI, Cohere, open-source options all perform similarly for general text. Pick one, stick with it, evaluate recall on your specific corpus.
Retrieval: Hybrid Search Always
Pure semantic search misses exact matches (SKUs, product names). Pure keyword search misses paraphrases. Combine: get top-K from each, merge, re-rank. Recall typically jumps 30–40 points.
- • Cleaned, canonicalized source data.
- • Semantic chunking with rich metadata.
- • Hybrid retrieval (semantic + keyword).
- • Re-ranker on top-20.
- • Citation-enforced generation.
- • Eval suite running weekly.
Re-ranking
A small re-ranker on top-20 retrieved chunks produces meaningfully better top-5. Cross-encoders (like Cohere's rerank or open-source alternatives) are cheap and effective.
Generation With Citations
The system prompt must require citation per claim. The model is forbidden from inventing facts not in the retrieved chunks. If the retrieved chunks don't support the question, the model must say so.
Top 7 Mistakes
- Throwing all your data in without cleaning.
- Default chunking sized for the model, not for the ideas.
- Pure semantic search with no keyword fallback.
- No re-ranking layer.
- Generation prompt that doesn't enforce citations.
- No eval suite to catch regressions.
- Stale data — no freshness pipeline.
A well-built RAG system is boring. It answers questions accurately, cites sources, refuses when uncertain, and stays up-to-date. The fancy version most builders try first is the opposite of boring — and the opposite of working.
See AI knowledge bases without hallucination.
FAQ
Vector DB choice? pgvector if you're on Postgres. Pinecone or Weaviate at scale.
How big a corpus before RAG is worth it? Above ~50 documents. Below that, just include them in context.
Cost? Embedding is cheap. Storage is cheap. The expensive part is the generation call — same as without RAG.