Almost every failed LLM engagement I have audited has the same missing artifact: a real eval set. The team built prompts that worked in the playground, shipped them, and then learned about the regressions from customer complaints because there was no automated test gate between "prompt edit" and "production." In traditional software, this would be like shipping with no unit tests. The fix is straightforward and unglamorous — build an eval harness in week one, gate every prompt and model change on it, and treat it as the contract the system is held to.

I have built eval harnesses for healthcare RAG (98.4% citation accuracy on a [medical-inquiry deployment](/case-studies/medical-inquiry-system)), telco SMS bots (68% L1 deflection across 110k weekly conversations), financial-services document agents, EdTech adaptive interview agents, and voice agents handling 4,000 calls a day. The harness pattern is consistent across all of them. Below is the 101 version — what to build, what to measure, how to score, and when to escalate from LLM-as-judge to human review.

What an eval harness actually is

An eval harness is a test suite for an LLM-powered system. It takes a curated set of representative inputs (the eval set), runs them through the system, and scores the outputs against expected results. The scoring can be deterministic (string match, structural match), heuristic (LLM-as-judge on rubric), or human (manual review on a sample). The harness runs in CI on every prompt, model, retrieval, or tool change — and a deployment that regresses on the harness does not merge.

The mental model: golden dataset = the contract. Eval harness = the CI test. Production observability = the runtime measurement. The three together turn an LLM system from a science experiment into a software product.

The three categories of eval sets

A production eval harness has three distinct dataset categories, each scored differently and used at different points in the development cycle.

Reference set — happy-path queries with known correct answers. Used to measure baseline accuracy. Typically 100-300 cases for a v1 build.
Golden set — the production contract. Curated by the customer's domain experts, signed off, frozen. Every release gates on this. Typically 200-500 cases.
Adversarial set — edge cases, ambiguous queries, safety-critical inputs, out-of-distribution queries. Used to measure refusal behavior and robustness. Typically 100-200 cases.

The reference set is for early development iteration. The golden set is what production releases gate on. The adversarial set ensures the system fails gracefully when it should. A team that builds only the reference set ships an agent that looks good in development and breaks in production; a team that builds only the adversarial set is over-indexed on edge cases and may not have a clear baseline.

The five eval scoring categories

Every eval case should be scored across multiple dimensions. Single-score eval ("is this answer correct, yes/no") is too coarse to surface where regressions happen. The five-dimension model we use:

Factual correctness — does the output match the ground truth? Scored with exact match, semantic match, or LLM-as-judge depending on the answer format.
Citation accuracy — for RAG, do the citations actually support the claim? Scored by retrieving the cited chunk and verifying it contains the supporting fact.
Refusal correctness — when the query is out of scope, safety-critical, or unanswerable, does the agent refuse cleanly? Scored against a binary expected-refusal flag per case.
Format conformance — for structured outputs (JSON, tool calls, specific response templates), does the output parse and conform? Scored deterministically with a schema validator.
Cost and latency — every eval case captures token cost and response latency so cost/latency regressions surface alongside accuracy regressions.

How to build the golden set

The golden set is the most load-bearing artifact in the entire engagement. Its quality directly determines whether the system can be trusted in production. Building it is unglamorous, expert-heavy work — and the temptation to shortcut it is the temptation that produces failed engagements.

Source the queries from real production traffic if available; otherwise from the customer's support tickets, FAQ logs, or domain expert interviews.
Have a domain expert (clinician for healthcare, lawyer for legal, ops manager for support) write the expected answer for each case — not the engineering team.
Categorize each case with metadata: query type, difficulty tier, expected refusal flag, safety category, expected response format.
Freeze the v1 golden set at 200-500 cases and only expand by deliberate quarterly review — not by individual engineers adding cases ad hoc.
Version-control the golden set in the same repo as the system code. Every change to the golden set requires a PR review.

LLM-as-judge versus human eval — when to use which

LLM-as-judge is the practice of using a separate LLM to score the output of the system under test. It scales — you can run a 500-case eval in three minutes for under a dollar. It is also noisy — judge models have their own biases, and naive LLM-as-judge gives misleadingly high accuracy on free-form outputs.

The decision rule we use:

Deterministic scoring (exact match, schema validation, citation chunk verification) — always preferred when the case structure allows.
LLM-as-judge with a rubric — for free-form factual answers where multiple correct phrasings exist. Anchor with 3-5 reference correct answers per case and a rubric the judge model scores against.
Human eval on a stratified sample — for safety-critical categories, ambiguous cases, and adversarial cases. Sample 20-50 cases per release, review by domain experts, flag disagreements with LLM-as-judge to recalibrate.
Production A/B with statistical lift measurement — for cases where ground truth is ambiguous and the right metric is downstream user behavior (completion rate, deflection rate, escalation rate).

Gating in CI — the merge-blocking pattern

An eval harness that is not gated in CI is a dashboard nobody reads. The merge-blocking discipline is what makes the harness load-bearing. Every pull request that touches a prompt, a model selection, a retrieval config, a tool definition, or a system prompt triggers a CI job that runs the harness and reports the delta against the prior release baseline.

PR-triggered eval on reference + golden sets — runs in 3-10 minutes typically, blocks merge on threshold regressions.
Nightly eval on the full adversarial set — slower (10-30 minutes), reports regressions in the PR comment thread of the merging commits.
Weekly eval against the latest production sample (1-5% of production traffic, scored against the harness) — catches eval-vs-prod drift.
Quarterly golden-set expansion review — the customer's domain experts review the latest production sample and add cases that surface new categories.

Eval tooling — what to use in 2026

The eval-platform tooling has matured fast in 2026. The defaults we recommend, with the decision criteria for each:

Braintrust — the strongest all-in-one for teams that want eval + traces + experiment tracking in one platform. Hosted SaaS; works for most non-regulated deployments.
Langfuse — open-source, self-hostable. The default when the customer needs eval data inside their own VPC for data-residency reasons (GDPR, HIPAA, regulated finance).
Phoenix Arize — best fit when the customer already runs Arize for traditional ML observability and wants the LLM eval in the same pane of glass.
Promptfoo — lightweight, CLI-driven, good for teams that want eval-as-code with minimal platform overhead. Best for v0/v1 builds before scaling up.
Custom in-repo Python eval — for teams that want zero external dependencies and are willing to build the experiment-tracking UI themselves. Works fine, scales less well.

Common eval-harness anti-patterns

Patterns I see in failed eval harnesses, in rough order of how often they appear:

Eval set written by the engineering team rather than domain experts. The set tests what the engineers thought the system should do, not what the users actually ask.
Single-dimension scoring (correctness only). Misses citation, refusal, format, and cost regressions until they hit production.
Eval run only at the end of the sprint instead of on every PR. Regressions accumulate and are hard to attribute to a specific change.
LLM-as-judge with no rubric and no reference answers. The judge model rewards plausible-sounding outputs over correct ones.
No adversarial set. The system has never been tested on inputs designed to break it; the first such input comes from a hostile user in production.
Eval set frozen at v1 and never updated. Production traffic drifts away from the set, the harness keeps passing, and the system silently regresses.

What success looks like

A team running a real eval harness has the following running in week one of the engagement: 200+ golden cases scored on five dimensions, CI gating on every prompt/model PR, production sampling against the golden set running weekly, and a quarterly review process to expand the golden set. The team can answer the question "what was our accuracy two releases ago, and which change caused the delta?" in 30 seconds by querying the eval platform. See our [shipping RAG in production post](/blog/shipping-rag-in-production) for how the harness anchors the rest of the RAG architecture.

The same team can hand the eval harness over to the customer's engineering team at the end of the engagement and the customer can keep running it without the vendor. That is what "eval-first delivery" actually means as a deliverable — not a slide saying "we believe in evals" but a working CI gate, a versioned dataset, and a documented operating procedure.

Wrapping up

The eval harness is the single most leveraged artifact in an LLM engagement. It anchors every architectural decision, gates every release, and makes the system handoff-able. Teams that skip it ship demos. Teams that build it ship products. The two are different categories of engagement and the eval harness is the dividing line.

If you are scoping an LLM build and want a written eval-set proposal — categories, case counts, scoring rubric, CI integration — [book a discovery call](/contact-us). We will return a one-pager inside 72 hours: the golden set structure for your use case, the platform recommendation, and the gating thresholds we would commit to in writing.

TaggedLLM evaluationeval harnessgolden datasetLLM as judgeRAG evalAI testing

LLM Evaluation Harness 101: How to Test an LLM Before Your Users Do

What an eval harness actually is

The three categories of eval sets

The five eval scoring categories

How to build the golden set

LLM-as-judge versus human eval — when to use which

Gating in CI — the merge-blocking pattern

Eval tooling — what to use in 2026

Common eval-harness anti-patterns

What success looks like

Wrapping up

More articles

RAG vs Fine-Tuning in 2026: Cost, Latency, and When to Pick Which

AI Agent Observability in Production: What to Instrument Before Launch

RAG Hallucination Rates — What Actually Moves the Needle

Shipping RAG in production — what nobody tells you

Hiring an AI Development Company in the USA in 2026: What to Ask, What to Verify

UK GDPR for AI Development: A Practical 2026 Guide

Ready to ship the system this post describes?