The most common technical question I get on Aiinfox discovery calls in 2026 is some variant of "should we use RAG or fine-tune the model?" The honest answer almost always starts with "you probably want RAG, and here is why" — and the cases where fine-tuning is the right call are narrower and more specific than the industry conversation suggests. Of the roughly 50 production AI systems we have shipped, around 40 use RAG as the core architecture, around 8 use a fine-tuned model, and a small handful combine both in deliberate ways.

What follows is the honest cost math we run when a client asks. Per-token costs, infrastructure costs, the eval cycle that dominates the engineering bill, the latency tradeoffs that surface in production, and the hybrid patterns that actually combine the two. Most of the conventional wisdom on this topic is either out of date (citing 2023 model prices), oversimplified ("fine-tune is cheaper at scale"), or selling something. The numbers below are from production deployments we have shipped or audited.

1. The pricing landscape in 2026 — what changed

Foundation-model pricing in 2026 is roughly 3-5x cheaper per token than in 2023 for equivalent-quality output, driven by improved model efficiency, prompt caching, and aggressive provider competition. Claude Sonnet, GPT-4o, and Llama 3 70B (hosted) all sit in the $1-5 per million input tokens range, with output tokens at $3-15. Prompt caching cuts the cached portion of the input cost by 60-90% for repeated system prompts.

Fine-tuning compute, by comparison, has not gotten dramatically cheaper. A LoRA fine-tune of Llama 3 8B on 5,000 examples runs $100-500 in compute. A full fine-tune of a 70B model runs $5,000-30,000 depending on dataset size and provider. But compute is rarely the binding constraint — the engineering and data costs that surround a fine-tune dominate the total cost of ownership, and those have not gotten cheaper.

2. RAG cost decomposition — where the money actually goes

A production RAG system's running cost decomposes into roughly four buckets: LLM tokens, vector store, embedding generation, and observability. For a mid-volume RAG system (100k queries/day, average 3k context tokens per query), the monthly numbers look like:

LLM tokens: $3,000-9,000/month depending on model choice and prompt caching efficiency
Vector store: $200-800/month for pgvector on a managed Postgres, or $500-2,500 for Pinecone / Qdrant Cloud at this scale
Embedding generation: $50-200/month for incremental embedding of new content (the bulk cost is one-time at corpus ingestion)
Observability and logging: $200-600/month for Langfuse, Datadog, or similar
Total: roughly $3,500-12,000/month at this volume

The LLM tokens are the dominant cost and the most controllable lever. The discipline of top-3 retrieval after re-ranking instead of top-10 unranked, prompt caching for the system prompt and tool definitions, and a right-sized model that clears the eval bar without going to the top-tier model — together these typically cut LLM costs by 60-80% versus the naive default configuration. See the [RAG hallucination post](/blog/rag-hallucination-rates-what-moves-the-needle) for the structural levers behind this.

3. Fine-tuning cost decomposition — where the money actually goes

A fine-tuning engagement decomposes very differently. The compute is the smallest line item; the engineering is the largest:

Data curation: 2-6 weeks of senior engineering and domain-expert review. The bulk of the project cost and the largest source of project risk. Typical: $25,000-80,000 in fully-loaded engineering and domain-review hours.
Eval harness construction: 1-2 weeks of senior engineering. You cannot fine-tune without a measurement instrument. Typical: $10,000-25,000.
Training pipeline and experiment tracking: 1 week of MLOps. Hyperparameter sweeps, checkpointing, dataset versioning. Typical: $8,000-15,000.
Compute for training runs: $500-30,000 depending on model size and number of runs. Often the smallest line item.
Deployment and serving: 1-2 weeks of MLOps. vLLM or TGI, throughput tuning, integration. Typical: $10,000-25,000.
Ongoing refresh: every 3-6 months as the domain drifts, a smaller version of all of the above. Typical: $15,000-40,000/quarter.

4. Latency — RAG adds milliseconds, fine-tuning subtracts them

RAG introduces retrieval-time latency: typically 20-80ms for hybrid retrieval, plus 10-30ms for re-ranking, plus a small amount of additional prompt processing for the retrieved context. End-to-end, RAG typically adds 50-150ms to first-token latency compared to a non-RAG call. For most use cases this is irrelevant; for voice agents, it can matter.

Fine-tuned models, especially smaller ones used as distilled replacements for larger models, can dramatically reduce latency. A fine-tuned Llama 3 8B running on vLLM on a single A100 can deliver first-token latency in the 100-200ms range versus 300-500ms for a hosted top-tier model. For latency-critical use cases — voice agents, sub-second interactive applications — fine-tuning a smaller model on outputs from a larger model is one of the most reliable latency wins. See the [voice agents under one second post](/blog/voice-agents-sub-second-latency) for the full latency budget.

5. When to pick RAG — the default case

RAG is the right call when the system needs to ground its outputs in a body of knowledge that changes — and "changes" includes adding new documents, updating existing ones, or scoping to a specific tenant's data in a multi-tenant deployment. The RAG architecture handles all of these naturally; the fine-tune architecture handles none of them without retraining.

Customer support knowledge bases — RAG. Knowledge changes weekly.
Medical inquiry agents grounded in clinical guidelines — RAG. Guidelines update, and the system needs to cite them.
Legal research grounded in case law — RAG. Citations need to be verifiable to the underlying source.
Internal enterprise search and copilots — RAG. Per-tenant data, per-permission retrieval.
Documentation Q&A — RAG. The docs are the source of truth and they update.

Our 98.4% citation accuracy medical-inquiry deployment is RAG. Our legal-research agent is RAG. Our telco support agent handling 110k+ conversations a week is RAG with tool calls. The pattern is consistent: when the question is "what do my documents say", the answer is RAG.

6. When to pick fine-tuning — the four scenarios that earn it

Fine-tuning is the right call in four scenarios. Outside these, the engagement is almost always better served by RAG plus careful prompting.

Scenario A: Structured output the base model cannot reliably produce

If the system needs to emit a specific JSON schema, a specific terminology, or a specific format consistently — and prompting alone gets to 92% but not 99% — fine-tuning closes the gap. Document extractors are the classic case. The base model can extract invoice fields at 92% accuracy; fine-tuning on 500-2,000 labelled invoices lifts it to 99.5% with reliably-shaped JSON.

Scenario B: Cost or latency requires a smaller distilled model

When the top-tier model produces excellent answers but at unit economics that do not work, fine-tuning a smaller open-weight model on outputs from the larger model — distillation — can give 90-95% of the quality at 1/20th the cost and 1/5th the latency. This is the second-most-common fine-tuning win we ship.

Scenario C: Data residency requires self-hosted with no hosted-API option

For deployments that require zero customer data egress — common in healthcare, defence, regulated finance, or specific EU clients — the system must be self-hosted. The available open-weight base models without fine-tuning lag the hosted top-tier on specialised tasks, so fine-tuning is often necessary to close the quality gap.

Scenario D: Persona, voice, or tone that prompting cannot reliably maintain

For consumer-facing applications where the AI's voice is part of the product — adaptive tutors, character agents, brand chatbots — prompting alone produces drift across long conversations. Fine-tuning on 1,000-5,000 example dialogues locks in the persona. Our [Mockinto deployment](/case-studies/interview-agent) — the adaptive AI interviewer that lifted user completion by 47% — uses fine-tuning to maintain the interviewer persona across 30-turn conversations where the base model would drift.

7. Hybrid patterns — when both make sense together

The cases where RAG and fine-tuning combine deliberately are narrow but high-value. The most common hybrid pattern: a fine-tuned model for tone, structure, and refusal behaviour, with RAG retrieval providing the facts at inference time. The fine-tune handles "how to answer"; the retrieval handles "what to answer about".

Concrete example: a healthcare clinical assistant fine-tuned to maintain a measured, refusal-prone tone appropriate for clinical context, with RAG retrieval providing the current formulary, current guideline, and current institutional protocol. The fine-tune gives the system the voice and the safety behaviour. The RAG gives it the facts that change every quarter as the formulary updates. Neither approach alone would deliver the production system.

8. The cost-per-incremental-accuracy-point math

The most useful framing for the RAG-versus-fine-tune decision is cost-per-incremental-accuracy-point. If your current system is at 92% accuracy and you need 97% to pass the production bar, the question is which lever gets you there for less.

Improving retrieval (hybrid retrieval, better chunking, re-ranking): typically 5-15 percentage points of recall lift, at 1-2 weeks of engineering. The cheapest lever.
Tightening the citation requirement and refusal threshold: typically 3-8 percentage points of accuracy lift on grounded responses, at 1 week of engineering.
Upgrading the model from mid-tier to top-tier: typically 1-3 percentage points of lift on a properly-engineered RAG system, at materially higher inference cost.
Fine-tuning a model on domain data: typically 5-15 percentage points of lift on specific structured-output or persona tasks, at $60,000-180,000 of engagement cost.

The honest order of operations: exhaust the RAG levers first. The model and chunking changes get you most of the way for most use cases. Fine-tuning earns its cost only when the gap is structural (Scenario A-D above) and the eval set proves the cheaper levers cannot close it.

9. The decision tree, condensed

Does the system need to ground its answers in documents that change? → RAG.
Does the system need to know per-tenant data in a multi-tenant deployment? → RAG.
Does the system need a specific structured output the base model cannot reliably produce? → Fine-tune.
Does the system need a smaller, cheaper, faster model with quality close to the top-tier? → Distill via fine-tuning.
Does the system require self-hosted with zero data egress? → Open-weight + fine-tune (usually).
Does the system need a consistent persona across long conversations? → Fine-tune.
Does the system need both grounded facts and a maintained tone? → Hybrid: fine-tune for tone, RAG for facts.
If none of the above, but the base model is not good enough? → Improve prompting, retrieval, and tool use first.

Wrapping up

RAG is the default for production AI in 2026 because it handles the most common production constraint — that the knowledge changes — without retraining. Fine-tuning is the right call in four well-defined scenarios, mostly involving structured output, distillation, self-hosting, or persona consistency. The hybrid pattern is real but narrow. Most engagements are best served by exhausting the RAG levers before considering fine-tuning, because the cost-per-incremental-accuracy-point math favours RAG for most use cases.

If you are deciding between RAG and fine-tuning on a specific build — and you want a 30-minute conversation that runs the decision tree on your actual constraints rather than recites principles — [book a discovery call](/contact-us). We will tell you on the call which architecture is the right fit, what the engagement looks like, and what the fixed-price scope inside 72 hours will be.

TaggedRAG vs fine-tuningRAG costfine-tuning costLLM cost optimizationhybrid RAG fine-tuneproduction AI cost

RAG vs Fine-Tuning in 2026: Cost, Latency, and When to Pick Which

1. The pricing landscape in 2026 — what changed

2. RAG cost decomposition — where the money actually goes

3. Fine-tuning cost decomposition — where the money actually goes

4. Latency — RAG adds milliseconds, fine-tuning subtracts them

5. When to pick RAG — the default case

6. When to pick fine-tuning — the four scenarios that earn it

Scenario A: Structured output the base model cannot reliably produce

Scenario B: Cost or latency requires a smaller distilled model

Scenario C: Data residency requires self-hosted with no hosted-API option

Scenario D: Persona, voice, or tone that prompting cannot reliably maintain

7. Hybrid patterns — when both make sense together

8. The cost-per-incremental-accuracy-point math

9. The decision tree, condensed

Wrapping up

More articles

AI Agent Observability in Production: What to Instrument Before Launch

LLM Evaluation Harness 101: How to Test an LLM Before Your Users Do

RAG Hallucination Rates — What Actually Moves the Needle

Shipping RAG in production — what nobody tells you

Hiring an AI Development Company in the USA in 2026: What to Ask, What to Verify

UK GDPR for AI Development: A Practical 2026 Guide

Ready to ship the system this post describes?