Fine-tuning is the most over-prescribed solution in production AI. Teams hear about it, assume it is the right answer because the base model is not good enough, spend two months curating data and running training, and end up with a model that performs no better than careful prompting plus retrieval would have delivered in a week. We have fine-tuned 12 models across healthcare, legal, EdTech, and finance for clients — and the honest answer is that fine-tuning is the right call about a third of the time we are asked for it.

Here is the actual decision tree we run when a client asks for fine-tuning, and the four scenarios where it genuinely pays off versus the cases where prompting, RAG, or model selection is the better answer.

When fine-tuning is the wrong answer

The most common request: "the foundation model does not answer questions correctly about our domain — we need to fine-tune it on our knowledge base". This is almost always wrong. If the model needs to know facts that are not in its training data, the right architecture is RAG (retrieval-augmented generation), not fine-tuning. RAG retrieves the relevant facts at query time and the model uses them. Fine-tuning a model on a knowledge base is expensive, freezes the knowledge at training time, and the model often still hallucinates because it has no grounding mechanism.

Four scenarios where fine-tuning is the right answer

Scenario 1: Structured output the model cannot reliably produce

If you need the model to emit a specific JSON schema, a specific terminology, or a specific structured format consistently — and prompting alone gets you to 92% but not 99% — fine-tuning closes the gap. Production document extractors are a common example: the foundation model can extract invoice fields 92% of the time but fine-tuning on 500 to 2,000 labelled invoices gets it to 99.5% with reliable JSON structure. We have done this for clients in finance and insurance with strong ROI in 8 to 10 weeks.

Scenario 2: Latency or cost requires a smaller model

GPT-4o or Claude Sonnet may produce excellent answers on your task, but at $3-15 per million tokens and 800ms p95 latency, the unit economics or user experience does not work. Fine-tuning a smaller open-weight model (Llama 3 8B, Mistral 7B) on outputs from the larger model — sometimes called distillation — can give you 90-95% of the quality at 1/20th the cost and 1/5th the latency. This is the second-most-common fine-tuning win we ship. Reference deployment: a healthcare voice agent that needed sub-500ms inference for HIPAA-compliant on-prem deployment — distilled Llama 3 8B fine-tuned on 8,000 examples from Claude got there.

Scenario 3: Data residency or sovereignty requires self-hosted

If the engagement requires zero customer data egress — common in healthcare, defence, regulated finance, or specific EU clients — you cannot use the major hosted LLMs at all. The model must be self-hosted, which usually means an open-weight base (Llama 3, Mistral, Qwen). Fine-tuning that base on your domain data is then often necessary because the smaller open-weight models without fine-tuning lag behind hosted models on specialised tasks.

Scenario 4: Tone, voice, or persona that prompting cannot reliably control

For consumer-facing applications where the AI's voice is part of the product — adaptive tutors, brand chatbots, character agents — prompting alone produces inconsistency across long conversations. The model drifts toward its default tone after 8-10 turns. Fine-tuning on 1,000 to 5,000 example dialogues in the desired voice locks in the persona. Mockinto (our EdTech case study, an adaptive AI interviewer) is a clean example — the fine-tuned model maintains the interviewer persona across 30-turn conversations where the base model would drift.

What fine-tuning actually costs (engineering, not just compute)

The compute bill for a LoRA fine-tune of Llama 3 8B on 5,000 examples is maybe $100-500 — almost a rounding error. The real cost is engineering and data:

Data curation: 2-6 weeks. Sampling representative examples, labelling them, validating label quality. This is the bulk of the work and the biggest source of project risk.
Eval harness: 1-2 weeks. You cannot fine-tune without a way to measure whether the fine-tune is better than the base. Build this before training, not after.
Training pipeline + experiment tracking: 1 week. Hyperparameter sweeps, checkpointing, versioned datasets and weights via MLflow or Weights & Biases.
Deployment + serving: 1-2 weeks. vLLM or TGI, throughput tuning, integration into the existing application.
Ongoing tuning: continuous. Domain drift means the fine-tune needs refresh every 3-6 months on most production deployments.

Decision tree, condensed

Need the model to know domain facts? → RAG. Do not fine-tune for knowledge.
Need a specific structured output the base model cannot reliably produce? → Fine-tune.
Need a smaller / cheaper / faster model with quality close to the large model? → Distill via fine-tuning.
Need self-hosted with zero data egress? → Open-weight + fine-tune (usually).
Need consistent tone or persona across long conversations? → Fine-tune.
If none of the above, but the base model is not good enough? → Improve prompting, retrieval, and tool use first. Most teams find the gap closes without fine-tuning.

Wrapping up

Fine-tuning is a powerful technique used for the wrong problem most of the time. When it is the right answer, it delivers production wins that nothing else can. When it is the wrong answer, it burns two months and a hundred thousand dollars to arrive at a system no better than what RAG plus careful prompting would have shipped. Run the decision tree before you commit. Build the eval harness first. And budget for the engineering time, not the compute bill — that is where the real cost lives.

TaggedLLM fine-tuningLoRA fine-tuningwhen to fine-tunefine-tuning vs RAGproduction LLM fine-tuningLlama 3 fine-tuning

When LLM Fine-Tuning Actually Pays Off

When fine-tuning is the wrong answer

Four scenarios where fine-tuning is the right answer

Scenario 1: Structured output the model cannot reliably produce

Scenario 2: Latency or cost requires a smaller model

Scenario 3: Data residency or sovereignty requires self-hosted

Scenario 4: Tone, voice, or persona that prompting cannot reliably control

What fine-tuning actually costs (engineering, not just compute)

Decision tree, condensed

Wrapping up

More articles

Hiring an AI Development Company in the USA in 2026: What to Ask, What to Verify

UK GDPR for AI Development: A Practical 2026 Guide

PIPEDA + Quebec Law 25 for AI in Canada: 2026 Compliance Checklist

Australian Privacy Act + APPs for AI Development in 2026

RAG vs Fine-Tuning in 2026: Cost, Latency, and When to Pick Which

Offshore AI Development in 2026: What Actually Works and What Doesn't

Ready to ship the system this post describes?