Most RAG hallucination guides spend half their length on "pick the right model" and the other half on prompt-engineering tricks. In our production RAG deployments — healthcare medical-inquiry agents at 98.4% citation accuracy, legal research agents at 97% citation accuracy with zero fabricated citations across four months of production, telco support bots at 110k+ conversations per week — neither the model choice nor the prompt was the dominant lever. The dominant levers are structural, boring, and unsexy. Here they are, ranked by how much they actually drop hallucination rate when we measure on real traffic.
If you are debugging a RAG system that hallucinates too often, work this list top-down. Most teams find the first three items close 70-80% of the gap before they touch the model.
Lever 1 — Hybrid retrieval (biggest single win)
Vector-only retrieval misses keyword-bound queries — error codes, drug names, SKU strings, case citations, regulation numbers, version strings. The embedding model never learned that "0x80070005" is a signal; it learned the surrounding semantic context, which is approximately useless. The fix is BM25 lexical search in parallel with dense embedding search, fused by a re-ranker (Cohere Rerank, Voyage rerank, or a small cross-encoder model).
Lever 2 — Chunking discipline (second-biggest win)
Chunking is where context dies. A naive 512-token sliding window splits a contract clause in half, separates a drug-dosage table from its column headers, orphans a footnote from the paragraph it modifies. Retrieval then returns a fragment that means almost nothing on its own — and the model fills the gap with its prior. That gap-filling is the most common source of hallucination we see in audits.
The fix is structural chunking — break on heading, table, list, or section boundary rather than arbitrary token count. Add document-level metadata (title, source, section, version) to every chunk so the retrieval layer can filter and the model has provenance to cite. Use overlapping windows for narrative content; use strict-boundary chunks for tabular or reference content. Most RAG accuracy regressions we audit trace back to chunking before the model.
Lever 3 — Required citations or no answer ships
Hallucinations in RAG almost always come from the model answering on its own when retrieval returned nothing relevant. The structural fix: require a citation on every answer, and reject any generated answer that does not link a specific claim to a specific retrieved chunk. If the model cannot ground its response, the system returns "I do not have enough information to answer" and routes to a human or a follow-up question.
This is enforced at generation time — the prompt format requires citation markers, and an output validator strips and rejects answers without them. Users learn to trust this faster than they trust any "AI confidence score" the model emits, because the citation gives them a verifiable anchor instead of a number to guess at. See [our RAG development page](/rag-development-services) for the full citation-required architecture.
Lever 4 — Refusal as a first-class action
Refusal is not the absence of an answer; it is an explicit, valued output. The model is configured to emit "I do not have enough information" as a structured response when no retrieval result clears a confidence threshold, when the query is out of the indexed scope, or when the query is in a category that requires human judgement (clinical triage, legal advice, financial guarantees).
Lever 5 — Re-ranking before stuffing context
Teams default to top-10 or top-20 retrieval and stuff every chunk into the prompt. This is worse for hallucination than top-3 after re-ranking, because the model gets more irrelevant content to weigh and is more likely to anchor on a marginally-relevant chunk that does not actually contain the answer. Re-ranking is cheap (single-digit milliseconds with a hosted cross-encoder), and a smaller, sharper context is both faster and more accurate.
- Retrieve top-20 candidates with hybrid retrieval
- Re-rank to top-3 with a cross-encoder model trained on relevance
- Stuff only the top-3 into the prompt — with full chunk text and metadata
- Track retrieval recall and citation accuracy per category in the eval harness
Lever 6 — Eval harness gating every prompt change
Every team building RAG "will set up evals later." Later never comes. The team that ships a working RAG system at week six is the team that wrote the golden eval set at week one. The eval set is 200-500 real queries from real users (or a careful simulation), each with a correct answer, the source it should cite, the correct refusal flag, and the expected tool call if the agent has tools.
Every prompt change, every model swap, every chunking tweak runs against the eval set. Citation accuracy, retrieval recall, refusal correctness, and hallucination rate are reported per intent category. Anything that regresses past threshold blocks the deploy — see the [LLM eval harness post](/blog/ai-evals-from-scratch) for the underlying pattern. Without CI-gated evals, every change is a roll of the dice; with them, every change has measured impact.
Lever 7 — Model choice (smaller than you think)
Model choice matters far less than the levers above. In our experience, the gap between Claude Sonnet, GPT-4o, and a well-tuned Llama 3 70B on a properly-engineered RAG system is 1-3 percentage points on citation accuracy and effectively zero on hallucination rate once citations are required. Where model choice does matter: structured-output reliability (top-tier models are noticeably better at clean JSON), instruction following on long contexts, and refusal-on-cue (top-tier models honour refusal instructions more reliably).
Default to the cheapest, fastest model that clears the eval bar. Upgrade only when the eval set proves the gap. We pick models per task on our [LLM development engagements](/llm-development-company) — not per vendor loyalty. Model selection is the last 10% of the gain after the structural levers, not the first 50%.
What does not move the needle
Three things teams routinely spend weeks on that we have not seen move hallucination rate in production:
- Prompt-engineering tricks ("think step-by-step", "you are an expert physician", chains of thought). Marginal at best; sometimes regressive on shorter-context tasks.
- Larger context windows. Stuffing more chunks rarely helps once you have hybrid retrieval and re-ranking; it just costs more tokens and slows generation.
- Switching embedding models without changing chunking. Embedding model upgrades produce 1-3% recall lifts; chunking discipline produces 15-20% lifts. Spend the engineering hours on the bigger lever.
Putting it together
If your RAG hallucination rate is too high, work the list top-down. Add hybrid retrieval first; redo chunking second; require citations and reject ungrounded answers third; design refusal as a first-class action fourth; add re-ranking fifth; gate every change on an eval harness sixth; tune model choice last. Most teams find the first three close 70-80% of the gap before they touch the model — which is the opposite of where teams instinctively look. The structural levers are unsexy and load-bearing. The model-choice lever is sexy and marginal.
If you are stuck on a RAG deployment that hallucinates more than your users tolerate — and the structural levers are not yielding the lift you need — that is exactly the kind of engagement we ship. Bring the eval set and the architecture; [book a 30-minute discovery call](/contact-us) and we will tell you on the call whether the gap is closeable in six weeks, what it will cost, and which lever to start with.
Frequently asked questions
Why is my RAG system hallucinating even with a top-tier LLM?
Because hallucination is rarely about the LLM. The model is making up an answer because retrieval returned the wrong chunks, or returned no chunks but the prompt did not require a citation, or the retrieved chunk was a fragment that lost its surrounding context. Work the list above top-down: hybrid retrieval, chunking, required citations. The model upgrade is the last lever, not the first.
What is an acceptable hallucination rate for production RAG?
Domain-dependent. For consumer-facing knowledge-base bots, 2-5% is typical and acceptable. For regulated workloads (healthcare, legal, finance), the bar is closer to 0.5-1% on safety-critical categories — and refusal is preferred over a guessed answer. Our [healthcare RAG deployment](/case-studies/medical-inquiry-system) runs at 98.4% citation accuracy with a 6% refusal rate, which clinicians find acceptable.
Can I use re-ranking without changing my existing vector store?
Yes — re-ranking sits between retrieval and the LLM. Keep your existing vector store (pgvector, Qdrant, Pinecone, Weaviate), retrieve top-20 with the existing query, then pass the candidates through Cohere Rerank or Voyage rerank or a small cross-encoder model and stuff only the top-3 into the prompt. Single-digit-millisecond overhead, measurable accuracy lift on most corpora.
How long does it take to drop hallucination rate on an existing RAG system?
Two to six weeks for a focused rebuild — week one is the eval set + audit, weeks two-three are hybrid retrieval + chunking rework, weeks four-five are citation enforcement + refusal tuning + re-ranking, week six is production rollout with drift monitoring. Most engagements drop hallucination rate by 3-8x on the same eval set in that window — without changing the LLM. [Book a discovery call](/contact-us) to scope the work for your system.
