Last updated: March 16, 2026

Export fine-tuned ChatGPT models by calling the OpenAI API with your fine-tuned model ID—you cannot directly download the weights but can run inference locally with proper API integration. This guide explains the limitations and the practical workflow for local deployment.

Table of Contents

Prerequisites

Before you begin, make sure you have the following ready:

Step 1: Understand What You Actually Have

Before attempting an export, you need to understand the technical reality: OpenAI does not provide a direct download mechanism for fine-tuned model weights. Their fine-tuning service creates a model that lives on their infrastructure, and you access it exclusively through their API. This is fundamentally different from open-source fine-tuning where you have full access to the model files.

However, several legitimate approaches exist to achieve local fine-tuned inference:

  1. Recreate the fine-tuning using open-source models — Take your training data and fine-tune an open-source model like Llama 2, Mistral, or Phi on it

  2. Use distilled models — Apply knowledge distillation techniques to create a smaller local model

  3. Convert OpenAI format models — If you have access to compatible model formats through other means

The most practical path for most developers involves recreating the fine-tuning on an open-source base model.

Step 2: What OpenAI’s Fine-Tuning API Actually Gives You

OpenAI’s fine-tuning API currently supports base models including gpt-4o-mini, gpt-3.5-turbo, and several others. When you submit a fine-tuning job, OpenAI:

  1. Takes your JSONL training file containing prompt-completion pairs
  2. Runs supervised fine-tuning (SFT) on their proprietary model infrastructure
  3. Creates a model variant with a unique ID like ft:gpt-3.5-turbo:your-org:custom-name:abc123
  4. Gives you API access to that model at per-token pricing higher than the base model

You own the training data and the fine-tuning job metadata. You do not own the resulting weights. This distinction matters enormously for portability and cost planning.

Step 3: Exporting Your Training Data

The first step is extracting the training data you used for fine-tuning. If you still have your training files, you’re in luck. If not, you can retrieve your fine-tuning job details and training examples through the OpenAI API.

import openai
import json

# Initialize with your API key
openai.api_key = "your-api-key"

# List your fine-tuning jobs
fine_tune_jobs = openai.FineTuningJob.list(limit=10)

# Get the job ID for your specific fine-tuned model
job_id = "ft-your-job-id"

# Retrieve job details
job = openai.FineTuningJob.retrieve(job_id)
print(f"Training file ID: {job.training_file}")
print(f"Base model: {job.model}")
print(f"Status: {job.status}")

Once you have your training file ID, download the training data:

# Download the training file
training_file_id = job.training_file
file_response = openai.files.content(training_file_id)
training_data = file_response.text

# Parse and save locally
with open("training_data.jsonl", "w") as f:
    f.write(training_data)

print("Training data saved to training_data.jsonl")

This training data becomes your seed for recreating the model locally.

Step 4: Converting OpenAI JSONL Format for Open-Source Training

OpenAI’s fine-tuning format uses a chat messages structure. Most open-source training frameworks expect a slightly different format, so you will need a conversion step:

import json

def convert_openai_to_alpaca(input_path: str, output_path: str) -> None:
    """Convert OpenAI chat fine-tuning JSONL to Alpaca instruction format."""
    converted = []

    with open(input_path) as f:
        for line in f:
            example = json.loads(line.strip())
            messages = example.get("messages", [])

            system_msg = next((m["content"] for m in messages if m["role"] == "system"), "")
            user_msg = next((m["content"] for m in messages if m["role"] == "user"), "")
            assistant_msg = next((m["content"] for m in messages if m["role"] == "assistant"), "")

            converted.append({
                "instruction": user_msg,
                "input": "",
                "output": assistant_msg,
                "system": system_msg
            })

    with open(output_path, "w") as f:
        for item in converted:
            f.write(json.dumps(item) + "\n")

    print(f"Converted {len(converted)} examples to {output_path}")

convert_openai_to_alpaca("training_data.jsonl", "training_alpaca.jsonl")

Step 5: Recreating the Fine-Tuned Model Locally

With your training data extracted, you can now fine-tune an open-source model. The goal is to replicate the behavior of your OpenAI fine-tuned model as closely as possible using local infrastructure.

Choosing Your Base Model

Select a base model that matches the capabilities of your fine-tuned model:

For most use cases, Mistral 7B provides excellent results with reasonable hardware requirements. A single A100 GPU or even a high-end consumer GPU can handle inference.

Hardware Requirements by Model Size

Model Minimum VRAM (FP16) Minimum VRAM (4-bit) Training VRAM (LoRA)
Phi-2 (2.7B) 6 GB 3 GB 10 GB
Mistral 7B 16 GB 6 GB 20 GB
Llama 2 13B 28 GB 10 GB 40 GB
Llama 2 70B 140 GB 40 GB 80 GB (multi-GPU)

Fine-Tuning Locally with LoRA

Low-Rank Adaptation (LoRA) lets you fine-tune efficiently without retraining the entire model:

# Install required packages
pip install transformers peft accelerate bitsandbytes

# Run fine-tuning with LoRA
python train_lora.py \
    --model mistralai/Mistral-7B-v0.1 \
    --data_path ./training_data.jsonl \
    --output_dir ./local-finetuned-model \
    --num_epochs 3 \
    --batch_size 4 \
    --learning_rate 2e-4 \
    --lora_r 16 \
    --lora_alpha 32

Here is a basic training script:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load your training data
dataset = load_dataset("json", data_files="training_data.jsonl")

# Training arguments
training_args = TrainingArguments(
    output_dir="./local-finetuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
)

# Train (simplified—you would use a proper trainer in production)
# This demonstrates the concept
print("Ready to fine-tune locally")

Step 6: Run Inference Locally

After fine-tuning completes, you can run inference without any external API calls:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_8bit=True,
    device_map="auto"
)

# Load fine-tuned weights
model = PeftModel.from_pretrained(
    base_model,
    "./local-finetuned-model",
    adapter_name="fine-tuned"
)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Run inference
def generate(prompt, max_tokens=256):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test your local model
response = generate("Your prompt here")
print(response)

Step 7: Serving the Model as a Local API

Once you have a working local model, you often want to expose it with an OpenAI-compatible API so that existing code that calls openai.chat.completions.create works without modification. The vllm library makes this straightforward:

pip install vllm

# Serve the merged model with an OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
    --model ./local-finetuned-model-merged \
    --host 0.0.0.0 \
    --port 8000

Your application can then point to http://localhost:8000 instead of the OpenAI API, with zero code changes beyond the base URL.

Step 8: Optimizing for Production

For production deployment, consider these optimizations:

# Example: Loading with 4-bit quantization using bitsandbytes
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=quantization_config,
    device_map="auto"
)

Step 9: Limitations and Tradeoffs

You should know that recreating your fine-tuned model locally will not produce identical results. Differences include:

Test thoroughly to ensure the local model meets your quality requirements before migrating production workloads. A practical evaluation strategy: run 50-100 representative prompts through both the OpenAI fine-tuned model and your local reproduction, then score the outputs on your specific quality criteria. Expect 80-90% behavioral parity as a realistic target; achieving exact parity is not possible without access to OpenAI’s internal training pipeline.

Cost Comparison: OpenAI API vs Self-Hosted

One of the primary motivations for local deployment is cost reduction at scale. Here is how the economics typically compare:

Scenario OpenAI Fine-Tuned (gpt-3.5-turbo) Self-Hosted Mistral 7B
Monthly inference volume 10M tokens 10M tokens
API cost (input + output) ~$120/month $0 (after hardware)
Hardware cost (A10G cloud) ~$300/month
Break-even volume ~25M tokens/month
Latency (P50) 200-400ms 50-150ms (local GPU)
Uptime guarantee 99.9% SLA Self-managed

At volumes above roughly 25M tokens per month, self-hosting becomes economically favorable. Below that threshold, the simplicity and reliability of OpenAI’s API usually wins. The latency advantage of local inference is significant for user-facing applications regardless of cost.

Step 10: Common Pitfalls to Avoid

Forgetting to merge LoRA adapters before serving. LoRA adapters are stored separately from base model weights. For production serving with vLLM, merge them first:

from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, "./local-finetuned-model")
merged = model.merge_and_unload()
merged.save_pretrained("./local-finetuned-model-merged")
tokenizer.save_pretrained("./local-finetuned-model-merged")

Using mismatched tokenizers. Always load the tokenizer from the same base model checkpoint. Mismatched tokenizers produce garbled outputs that are hard to diagnose.

Skipping chat template formatting. Mistral and Llama models require specific chat templates to produce coherent conversational outputs. Pass messages through the tokenizer’s apply_chat_template method rather than formatting prompts manually.

Troubleshooting

Configuration changes not taking effect

Restart the relevant service or application after making changes. Some settings require a full system reboot. Verify the configuration file path is correct and the syntax is valid.

Permission denied errors

Run the command with sudo for system-level operations, or check that your user account has the necessary permissions. On macOS, you may need to grant terminal access in System Settings > Privacy & Security.

Connection or network-related failures

Check your internet connection and firewall settings. If using a VPN, try disconnecting temporarily to isolate the issue. Verify that the target server or service is accessible from your network.

Frequently Asked Questions

How long does it take to export chatgpt api fine-tuned model for local?

For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.

What are the most common mistakes to avoid?

The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.

Do I need prior experience to follow this guide?

Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.

Will this work with my existing CI/CD pipeline?

The core concepts apply across most CI/CD platforms, though specific syntax and configuration differ. You may need to adapt file paths, environment variable names, and trigger conditions to match your pipeline tool. The underlying workflow logic stays the same.

Where can I get help if I run into issues?

Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.

Built by theluckystrike — More at zovo.one