Almost every production AI failure we have audited at customers — hallucinations leaking into customer chat, voice agents going off-script, document extractors emitting malformed JSON — traces back to the same missing piece: there was no eval harness, or there was one but nobody ran it before deploys. The model that worked in the prompt playground started misbehaving the moment real users hit it, and the team had no way to know which prompt change broke what.
An LLM eval harness is the contract between your AI system and the business outcome it is supposed to deliver. It is also the only thing that makes shipping prompt and model changes safe. Here is what it actually consists of, written from the deployments where we have built one.
1. The eval harness is a test suite, not a benchmark
Public benchmarks like MMLU, HellaSwag, and GSM8K measure general-purpose model capability. They do not measure your system. A model that scores 92% on MMLU may still mishandle 30% of your billing queries because your billing queries depend on terminology, tone, and retrieval that the public benchmark does not contain.
An eval harness for your system is built from your data: 200 to 500 representative queries drawn from real user transcripts (or, before launch, your team's best simulation of them), each with the correct answer, the correct cited source, the correct tool to call, and a refusal flag if the query is out of scope. This is the golden set. Every prompt change, every model swap, every chunking tweak runs against the golden set before merging.
2. Pick the right metrics — and report them per-category
A single quality score is almost always misleading because failures cluster in specific categories. The standard metric stack for a RAG chatbot looks like this, with each measured separately and reported by intent category:
- Retrieval recall@k — did the top k retrieved chunks include the source that contains the answer?
- Citation accuracy — does the cited source actually support the answer?
- Refusal correctness — did the system refuse out-of-scope queries and answer in-scope ones?
- Tool-call validity — did the agent call the right tool with the right arguments?
- Hallucination rate — fraction of answers that contain a fact not supported by any retrieved chunk
- Latency p50 / p95 — end-to-end response time per turn
- Cost per turn — token consumption × model price
3. Auto-score where possible, human-score where it matters
Running every eval prompt past a human reviewer does not scale. But auto-scoring naively (e.g. exact-string match against the golden answer) misses paraphrases that are equally correct. The pragmatic split: use deterministic auto-scoring for retrieval recall, citation correctness, tool-call validity, refusal correctness, latency, and cost — these are objective. For answer quality on open-ended generations, use an LLM-as-judge with a clear rubric, validated against a sample of human-scored examples.
LLM-as-judge is cheaper, faster, and (when validated) reliably correlates with human scoring above 90% on most tasks. Validation means: take 50 golden examples, have a human score them, have the judge LLM score them, compute Cohen's kappa. If kappa is above 0.7, ship the judge. If not, refine the rubric until it is.
4. Gate every release on the harness
The eval harness only adds value if it actually blocks deploys. Wire it into CI: any pull request that changes a prompt, retrieval config, model selection, or tool definition triggers a full eval run. The run reports the delta from the previous version on every metric. If any metric regresses past threshold (e.g. citation accuracy drops more than 2 percentage points), the deploy is blocked.
This is the boring, unglamorous discipline that separates AI systems that ship reliably from AI systems that ship once and then quietly decay over months of "just one more prompt tweak". With CI-gated evals, every change has measured impact. Without them, every change is a roll of the dice.
5. Run continuous evals on production traffic, not just CI
CI evals catch known regressions. Continuous production evals catch unknown ones — the shifts in user behaviour, corpus drift, or model-provider changes that you did not anticipate. Sample 5 to 10 percent of production traffic, run it through the eval rubric (LLM-as-judge), and alert when category-level metrics shift more than two standard deviations from baseline.
- Daily evals on full production traffic for low-volume systems
- Sampled hourly evals on high-volume systems (chatbots handling 10k+ conversations/day)
- Per-customer eval slices for multi-tenant SaaS — catches when a single tenant's corpus shift breaks their experience
- Drift alerts that fire on metric shifts (not just absolute thresholds) so you catch slow degradation
6. Keep the golden set honest
The single biggest mistake teams make with eval harnesses is letting the golden set rot. New product features ship, the user behaviour distribution changes, and the eval set keeps testing what last quarter's product did. Six months in, the harness reports 99% on a set of queries nobody actually asks anymore.
Refresh the golden set on a cadence — quarterly is a reasonable default for most systems, monthly for fast-moving products. Sample new real-user queries (with PII redacted), have a human review them, add the high-signal ones to the golden set, retire the obsolete ones. The harness should evolve with the product, not lock it into a past version of itself.
Wrapping up
An eval harness is the load-bearing piece of production AI infrastructure. It is what lets you ship prompt changes confidently, swap models without dread, and prove to a regulator that your system did what you said it would. Build it in week one of the engagement, gate every release on it, and refresh it quarterly. The teams that do this ship AI that survives past the demo. The teams that skip it ship a brittle prototype and rebuild it six months later.
