Most RAG prototypes look great on the demo deck and fall apart in week three of production. The vector store query returns the wrong chunks. The model invents a citation that does not match the source document. A user asks something out of scope and the system fabricates an answer rather than refusing. The team patches the prompt, ships the patch, regresses something else. Three months in, nobody trusts the bot.
We have shipped retrieval-augmented generation systems for healthcare, finance, telco, staffing, and EdTech. Across those deployments, the failure modes are the same, the fixes are the same, and almost none of them are about the LLM. Here is the short version of what nobody tells you when you start building RAG.
1. Naive vector-only retrieval misses keyword-bound queries
Dense embeddings are great at semantic similarity and terrible at exact-match retrieval. Your support agent asks about "error code 0x80070005" and the vector store returns documents about general permission errors — because the embedding model never learned that the error code is the signal. Same for drug names, SKU codes, legal citation formats, regulation numbers, and product version strings. If your domain has any of those, vector-only retrieval is going to underperform.
2. Chunking is more important than the embedding model
Teams obsess over which embedding model to use and ignore the question that actually moves the needle: how the source documents are broken into chunks. Chunking is where context dies. A naive 512-token sliding window splits a contract clause in half, separates a question from its answer, and orphans a footnote from the table it explains. Retrieval then returns a fragment that means almost nothing on its own.
- Chunk by structural boundary (heading, table, list, section) not arbitrary token count
- Add document-level metadata (title, source, section) to every chunk so retrieval can filter by it
- Use overlapping windows for narrative content; use strict-boundary chunks for tabular or reference content
- Re-evaluate chunking when accuracy regresses — most regressions trace back here before the model
3. Required citations or no answer ships
Hallucinations in RAG almost always come from the model answering on its own when retrieval returned nothing relevant. The fix is structural, not prompty: require a citation on every answer, and reject any answer without one. If the model cannot ground its response in a retrieved chunk, the system says "I do not have enough information to answer" and routes to a human or a follow-up question. Users learn to trust this faster than they trust the "AI confidence score" everyone wishes they could just print.
4. The eval harness is the contract — write it first
Every team building RAG "will set up evals later." Later never comes. The team that ships a working RAG system at week six is the team that wrote the golden eval set at week one. The eval set is 200-500 real queries from your real users (or your best simulation of them), each with a correct answer, the source it should cite, and a refusal flag for out-of-scope queries.
Every prompt change, every model swap, every chunking tweak runs against the eval set. The harness reports retrieval recall, citation accuracy, refusal correctness, and end-to-end answer quality. Anything that regresses past threshold is blocked from shipping. This is the single highest-leverage piece of RAG infrastructure and the most common one to skip.
5. Cost and latency live or die at retrieval time
The dominant cost in production RAG is usually the LLM tokens — which means it is dominated by how many retrieved chunks you stuff into the prompt. Teams default to top-10 or top-20 retrieval and pay 5× the cost of a properly re-ranked top-3 setup that scores higher on the eval set. Smaller context windows are cheaper, faster, and frequently more accurate because the model has less irrelevant content to weigh.
- Default to top-3 after re-ranking, not top-10 before re-ranking
- Use prompt caching (Anthropic, OpenAI) for the system prompt and tool definitions — 60-90% cost reduction at scale
- Cache embeddings server-side; only re-embed on document change, not query
- Set per-step latency budgets (retrieval, rerank, LLM) and alert on regressions
6. Permissions are part of retrieval, not a layer above it
In multi-tenant or permission-aware deployments — most enterprise RAG — the temptation is to retrieve from a shared vector store and filter results after the fact based on the calling user's permissions. This is wrong on two levels. First, the embedding cost is wasted on documents the user will never see. Second, more importantly, post-filter leaks information through retrieval timing and through partial results that are visible in logs and analytics.
The right pattern is permission-aware retrieval: the user's permission scope is part of the query, and the vector store filters at the index level before similarity. In Postgres + pgvector, this is a WHERE clause on a tenant_id or role column. In Qdrant or Weaviate, it is a payload filter combined with the vector search.
Wrapping up
RAG is mostly an engineering discipline problem, not a model problem. Pick the simplest retrieval architecture that clears your eval bar, require citations, run a refusal layer, and gate every change against your golden set. The teams that get this right ship in 6-8 weeks. The teams that obsess over which LLM provider has the best benchmark numbers ship a brittle demo and rebuild it in month four.
