Most RAG prototypes look great on the demo deck and fall apart in week three of production. The vector store query returns the wrong chunks. The model invents a citation that does not match the source document. A user asks something out of scope and the system fabricates an answer rather than refusing. The team patches the prompt, ships the patch, regresses something else. Three months in, nobody trusts the bot.

We have shipped retrieval-augmented generation systems for healthcare, finance, telco, staffing, and EdTech. Across those deployments, the failure modes are the same, the fixes are the same, and almost none of them are about the LLM. Here is the short version of what nobody tells you when you start building RAG.

1. Naive vector-only retrieval misses keyword-bound queries

Dense embeddings are great at semantic similarity and terrible at exact-match retrieval. Your support agent asks about "error code 0x80070005" and the vector store returns documents about general permission errors — because the embedding model never learned that the error code is the signal. Same for drug names, SKU codes, legal citation formats, regulation numbers, and product version strings. If your domain has any of those, vector-only retrieval is going to underperform.

2. Chunking is more important than the embedding model

Teams obsess over which embedding model to use and ignore the question that actually moves the needle: how the source documents are broken into chunks. Chunking is where context dies. A naive 512-token sliding window splits a contract clause in half, separates a question from its answer, and orphans a footnote from the table it explains. Retrieval then returns a fragment that means almost nothing on its own.

Chunk by structural boundary (heading, table, list, section) not arbitrary token count
Add document-level metadata (title, source, section) to every chunk so retrieval can filter by it
Use overlapping windows for narrative content; use strict-boundary chunks for tabular or reference content
Re-evaluate chunking when accuracy regresses — most regressions trace back here before the model

3. Required citations or no answer ships

Hallucinations in RAG almost always come from the model answering on its own when retrieval returned nothing relevant. The fix is structural, not prompty: require a citation on every answer, and reject any answer without one. If the model cannot ground its response in a retrieved chunk, the system says "I do not have enough information to answer" and routes to a human or a follow-up question. Users learn to trust this faster than they trust the "AI confidence score" everyone wishes they could just print.

4. The eval harness is the contract — write it first

Every team building RAG "will set up evals later." Later never comes. The team that ships a working RAG system at week six is the team that wrote the golden eval set at week one. The eval set is 200-500 real queries from your real users (or your best simulation of them), each with a correct answer, the source it should cite, and a refusal flag for out-of-scope queries.

Every prompt change, every model swap, every chunking tweak runs against the eval set. The harness reports retrieval recall, citation accuracy, refusal correctness, and end-to-end answer quality. Anything that regresses past threshold is blocked from shipping. This is the single highest-leverage piece of RAG infrastructure and the most common one to skip.

5. Cost and latency live or die at retrieval time

The dominant cost in production RAG is usually the LLM tokens — which means it is dominated by how many retrieved chunks you stuff into the prompt. Teams default to top-10 or top-20 retrieval and pay 5× the cost of a properly re-ranked top-3 setup that scores higher on the eval set. Smaller context windows are cheaper, faster, and frequently more accurate because the model has less irrelevant content to weigh.

Default to top-3 after re-ranking, not top-10 before re-ranking
Use prompt caching (Anthropic, OpenAI) for the system prompt and tool definitions — 60-90% cost reduction at scale
Cache embeddings server-side; only re-embed on document change, not query
Set per-step latency budgets (retrieval, rerank, LLM) and alert on regressions

6. Permissions are part of retrieval, not a layer above it

In multi-tenant or permission-aware deployments — most enterprise RAG — the temptation is to retrieve from a shared vector store and filter results after the fact based on the calling user's permissions. This is wrong on two levels. First, the embedding cost is wasted on documents the user will never see. Second, more importantly, post-filter leaks information through retrieval timing and through partial results that are visible in logs and analytics.

The right pattern is permission-aware retrieval: the user's permission scope is part of the query, and the vector store filters at the index level before similarity. In Postgres + pgvector, this is a WHERE clause on a tenant_id or role column. In Qdrant or Weaviate, it is a payload filter combined with the vector search.

Wrapping up

RAG is mostly an engineering discipline problem, not a model problem. Pick the simplest retrieval architecture that clears your eval bar, require citations, run a refusal layer, and gate every change against your golden set. The teams that get this right ship in 6-8 weeks. The teams that obsess over which LLM provider has the best benchmark numbers ship a brittle demo and rebuild it in month four.

TaggedRAG productionretrieval augmented generationhybrid retrievalRAG citationsRAG refusal layerRAG eval harness

Shipping RAG in production — what nobody tells you

1. Naive vector-only retrieval misses keyword-bound queries

2. Chunking is more important than the embedding model

3. Required citations or no answer ships

4. The eval harness is the contract — write it first

5. Cost and latency live or die at retrieval time

6. Permissions are part of retrieval, not a layer above it

Wrapping up

More articles

RAG vs Fine-Tuning in 2026: Cost, Latency, and When to Pick Which

AI Agent Observability in Production: What to Instrument Before Launch

LLM Evaluation Harness 101: How to Test an LLM Before Your Users Do

RAG Hallucination Rates — What Actually Moves the Needle

Hiring an AI Development Company in the USA in 2026: What to Ask, What to Verify

UK GDPR for AI Development: A Practical 2026 Guide

Ready to ship the system this post describes?