Fine-Tuning vs RAG: When to Use Each Approach to Customize an LLM

When a generic LLM isn't enough for your use case — because it doesn't know your domain, your style, or your specific information — you have two main paths: fine-tuning or RAG. Confusion between the two is common, and choosing the wrong approach means losing weeks of work.

This guide explains what each one solves, when each makes sense, and how to combine them when neither is sufficient on its own.

The Problem Each One Solves

To understand the difference, it helps to think about what type of "knowledge" your application needs.

RAG solves the problem of specific factual knowledge: the model doesn't know things that are in your documents, your database, or your system. The solution is to give them to the model at query time.

Fine-tuning solves the problem of behavior and style: the model knows things but doesn't express them the way you need, doesn't follow the correct format, doesn't have the right tone, or makes systematic errors on a specific type of task.

Put differently:

  • RAG changes what the model knows on each query
  • Fine-tuning changes how the model behaves in general

What RAG Is and When to Use It

RAG (Retrieval-Augmented Generation) connects the LLM to an external knowledge base. Before generating a response, the system retrieves the most relevant fragments from that base and includes them in the model's context.

We covered the full implementation in the article RAG Explained: How to Connect Your Own Documents to an LLM. Here we focus on when it's the right choice.

RAG makes sense when:

Your information changes frequently. If your knowledge base updates regularly — pricing, inventory, regulations, product documentation — RAG is the only practical option. Retraining a model every time a data point changes isn't viable.

You need traceability. RAG can return the exact sources it used to generate a response. For applications where users need to verify information — technical support, legal advice, medical documentation — this is critical.

The volume of information is large. Putting 10,000 documents in the context isn't possible. RAG selects the relevant ones for each query.

You want to get started quickly. RAG requires no training. With a document base and a vector store, you can have a working system in hours.

RAG doesn't make sense when:

  • The problem is that the model doesn't know how to do something, not that it lacks information
  • You need the model to adopt a specific style or format consistently
  • The task is so specialized that the base model makes systematic errors

What Fine-Tuning Is and When to Use It

Fine-tuning is retraining a pre-existing model with your own data to adjust its behavior. You're not teaching the model brand new information from scratch — you're adjusting the model's weights so it responds better to a specific type of task.

The basic process:

  1. Prepare a dataset of examples: pairs of (prompt, ideal response)
  2. Train the model with those examples for several epochs
  3. The adjusted model behaves more consistently with the pattern of your examples

Fine-tuning makes sense when:

You need a very specific and consistent style or format. If your application needs the model to always respond in JSON with an exact structure, always use a precise clinical tone, or always follow a particular reasoning flow — fine-tuning is more efficient than including long instructions in every prompt.

The base model makes systematic errors in your domain. If you work in a highly specialized domain — tax law specific to a country, medical terminology of a specific specialty, a proprietary programming language — and the base model fails predictably, fine-tuning can correct those patterns.

You want to reduce the cost of long prompts. If your system prompt has 2,000 tokens of instructions that repeat in every call, fine-tuning can internalize those instructions in the model and eliminate the need to include them. At scale, the token savings are significant.

You have enough quality data. Fine-tuning with few examples or low-quality examples produces results worse than the base model. As a practical minimum, you need several hundred well-curated examples.

Fine-tuning doesn't make sense when:

  • The information you need to incorporate changes frequently
  • You don't have enough labeled quality data
  • The problem is lack of information, not behavior
  • You're in prototype phase — the cost and time of fine-tuning aren't justified until you validate the use case

How to Fine-Tune with the OpenAI API

OpenAI offers fine-tuning on GPT-4o-mini and GPT-3.5-turbo. The official fine-tuning documentation covers the complete process. Here's the essential flow:

Prepare the Dataset

The required format is JSONL (JSON Lines) — one JSON object per line:

{"messages": [{"role": "system", "content": "You are a technical support assistant for accounting software."}, {"role": "user", "content": "How do I export a report as PDF?"}, {"role": "assistant", "content": "To export as PDF: go to Reports → select the report → click Export → choose PDF. The file downloads automatically."}]}
{"messages": [{"role": "system", "content": "You are a technical support assistant for accounting software."}, {"role": "user", "content": "I can't log in."}, {"role": "assistant", "content": "Verify the email matches your account. If you forgot your password, use the 'Forgot password' link on the login screen. If the problem persists, contact support at support@yourapp.com."}]}

Each line is a complete conversation example. OpenAI recommends a minimum of 50-100 examples, though 500 or more is preferable for robust results.

Upload the Dataset and Launch Fine-Tuning

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Upload the training file
with open("training_dataset.jsonl", "rb") as f:
    file = client.files.create(
        file=f,
        purpose="fine-tune"
    )

print(f"File uploaded: {file.id}")

# Launch the fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18"
)

print(f"Fine-tuning job: {job.id}")
print(f"Status: {job.status}")

Monitor and Use the Model

# Check job status
job_status = client.fine_tuning.jobs.retrieve(job.id)
print(f"Status: {job_status.status}")
print(f"Model: {job_status.fine_tuned_model}")

# Once complete, use the fine-tuned model
if job_status.fine_tuned_model:
    response = client.chat.completions.create(
        model=job_status.fine_tuned_model,
        messages=[
            {"role": "system", "content": "You are a technical support assistant for accounting software."},
            {"role": "user", "content": "How do I generate the quarterly report?"}
        ]
    )
    print(response.choices[0].message.content)

Training time varies with dataset size — from 15 minutes for small datasets to several hours for large ones. Fine-tuning cost is charged per token processed during training, plus a slightly higher per-token cost when using the fine-tuned model. Current pricing is at openai.com/api/pricing.

Fine-Tuning with Open Source Models

If data privacy is a constraint or you want to avoid OpenAI's fine-tuning cost, you can fine-tune open source models like Llama 3.1 or Mistral.

The most widely used tools:

Unsloth: optimizes fine-tuning to be 2-5x faster and use less VRAM. Compatible with Llama, Mistral, and Qwen. Has notebook-format tutorials to get started quickly.

Axolotl: more complete framework for fine-tuning with support for multiple techniques (LoRA, QLoRA, full fine-tuning). More initial configuration but more flexible.

LLaMA Factory: web interface and CLI for fine-tuning many open source models without writing code.

The most widely used technique for fine-tuning with limited hardware is LoRA (Low-Rank Adaptation), which trains only a small fraction of the model's parameters instead of all of them, dramatically reducing VRAM requirements. The original paper is at arxiv.org/abs/2106.09685.

Combining RAG and Fine-Tuning

They are not mutually exclusive. In mature production systems, it's common to use both:

  • Fine-tuning so the model adopts the correct tone, format, and behavior
  • RAG so the model accesses updated and specific information

Practical example: a support assistant for a software company might have:

  • A fine-tuned model to always respond in the company's tone, follow the standard response format, and correctly handle the most frequent query types
  • RAG connected to the product documentation, which updates with each release

Fine-tuning makes the model consistent. RAG keeps the model current.

Quick Decision Guide

If you're not sure which approach to use, this tree covers most cases:

Is the problem that the model lacks specific information? → Yes → RAG → No → continue

Does the information change frequently? → Yes → RAG (fine-tuning isn't viable for dynamic information) → No → continue

Does the base model make systematic errors in format, tone, or task type? → Yes → Fine-tuning → No → continue

Do you have more than 200 high-quality examples? → No → improve the prompt before fine-tuning → Yes → fine-tuning may make sense

Are you in prototype phase? → Yes → start with RAG or prompt engineering, fine-tune after validating → No → evaluate the cost/benefit of fine-tuning for production

The general rule: exhaust prompt engineering possibilities before fine-tuning, and consider RAG before fine-tuning if the problem is knowledge-based. Fine-tuning is the right tool — just not the first one you should reach for.