Why Language Models Hallucinate at All
A language model that hallucinates is doing exactly what it was trained to do: predict the next plausible token. It has no concept of “true” or “false” — only of likely. When the model is asked something its training does not strongly cover, it interpolates. The interpolation sounds right. It is often wrong.
The fix is not a better model. The fix is to never let the model interpolate on facts that matter. Every factual claim in an answer must trace back to a retrieved passage from your ground truth source. If the passage isn't there, the model must refuse rather than guess. This is the entire discipline of building an AI knowledge base that doesn't hallucinate.
- • A single canonical source of truth, version-controlled.
- • Retrieval that returns citations, not just text.
- • A prompt contract that forces the model to ground every claim.
- • Refusal as a first-class behavior, not an embarrassment.
Ground Truth First: Your Knowledge Base Is a Product
The most common failure mode: teams point a RAG system at a sprawling Confluence, half a Notion, a Drive folder, and some Slack history. The retrieval gets duplicates, contradictions, and stale pages. The model picks whichever passage looks most relevant — often the wrong one.
Treat your knowledge base as a product with an owner, a change log, and a freshness SLA. Every document has a status: canonical, draft, or archived. Only canonical docs are indexed. When two canonical docs conflict, that is a bug; resolve it before shipping.
Chunking and Retrieval Patterns
Chunking is where most builds quietly fail. The default of “split every 500 tokens” mutilates ideas mid-thought and the model gets a passage that ends on a comma. We recommend semantic chunking: break on H2/H3 boundaries, paragraphs, or list items. Each chunk should be one coherent idea.
Hybrid Retrieval
Pure semantic search misses exact-match queries (“what is the refund window for product SKU-4421?”). Pure keyword search misses paraphrases. Use both, rerank with a small model. The combined recall is usually 30–40 points higher than either alone.
Citations or It Didn't Happen
Retrieval should return passages with citations— document ID, section, and version. The model is then required to attach citations to every claim it makes. If the user clicks the citation and it doesn't support the claim, that is a measurable, fixable bug.
The Prompt Contract
The system prompt is a legal contract between you and the model. Ours always includes four clauses:
- “Answer only using the retrieved passages below. Do not use prior knowledge.”
- “Every factual claim must include a citation in [square brackets].”
- “If the passages do not contain enough information, reply with the exact phrase: I don't have a confident answer for this. Please contact support.”
- “Never combine information from passages dated more than 90 days apart without flagging the date gap.”
Clause 3 is the most important. Refusal is the feature that distinguishes a trustworthy AI knowledge base from a confident liar. Train your team to celebrate refusals — they are the system protecting your customers.
Engineering for Uncertainty
Add a confidence scoreto every answer. The score is computed from: (a) cosine similarity of the top retrieved chunk, (b) how recent the chunk is, (c) whether the model self-reports certainty. Answers below a threshold are shown to the user with a visible “low confidence” banner and a one-tap escalation to a human.
Honesty about uncertainty is the single biggest user-experience advantage you have over generic chatbots. Customers forgive “I'm not sure, let me escalate.” They do not forgive confidently wrong answers.
Keeping the Knowledge Base Fresh
A stale knowledge base hallucinates by definition — the “facts” it returns are no longer true. Build a freshness pipeline:
- Each document carries a last reviewed date.
- Documents older than 90 days trigger a review reminder.
- Documents older than 180 days are auto-flagged as “stale” in retrieval.
- Customer questions where the AI refused get analyzed weekly — missing knowledge becomes a doc creation ticket.
Evaluating Honesty, Not Just Accuracy
Standard eval suites measure accuracy: did the model produce the right answer? For knowledge bases you also need honesty: did the model refuse when it should have? A 100-question test set should include 20 questions whose answers are not in your knowledge base. The right behavior is to refuse all 20.
For more on the broader stack, see our RAG architecture guide and how we build production knowledge systems.
FAQ
How big does our knowledge base need to be? Quality over volume. 200 high-quality canonical docs beat 5,000 mixed-quality ones every time.
Can we use the public web as a knowledge source? Only with domain restrictions and freshness checks. The open web is a hallucination accelerant.
What model do you recommend? Any frontier model can be honest with the right prompt contract. We default to Claude for its strong refusal behavior, but the architecture matters far more than the model.