How to Export ChatGPT API Fine-Tuned Model for Local

Last updated: March 16, 2026

Export fine-tuned ChatGPT models by calling the OpenAI API with your fine-tuned model ID—you cannot directly download the weights but can run inference locally with proper API integration. This guide explains the limitations and the practical workflow for local deployment.

Prerequisites
Cost Comparison: OpenAI API vs Self-Hosted
Troubleshooting
Related Reading

Prerequisites

Before you begin, make sure you have the following ready:

A computer running macOS, Linux, or Windows
Terminal or command-line access
Administrator or sudo privileges (for system-level changes)
A stable internet connection for downloading tools

Step 1: Understand What You Actually Have

Before attempting an export, you need to understand the technical reality: OpenAI does not provide a direct download mechanism for fine-tuned model weights. Their fine-tuning service creates a model that lives on their infrastructure, and you access it exclusively through their API. This is fundamentally different from open-source fine-tuning where you have full access to the model files.

However, several legitimate approaches exist to achieve local fine-tuned inference:

Recreate the fine-tuning using open-source models — Take your training data and fine-tune an open-source model like Llama 2, Mistral, or Phi on it
Use distilled models — Apply knowledge distillation techniques to create a smaller local model
Convert OpenAI format models — If you have access to compatible model formats through other means

The most practical path for most developers involves recreating the fine-tuning on an open-source base model.

Step 2: What OpenAI’s Fine-Tuning API Actually Gives You

OpenAI’s fine-tuning API currently supports base models including gpt-4o-mini, gpt-3.5-turbo, and several others. When you submit a fine-tuning job, OpenAI:

Takes your JSONL training file containing prompt-completion pairs
Runs supervised fine-tuning (SFT) on their proprietary model infrastructure
Creates a model variant with a unique ID like ft:gpt-3.5-turbo:your-org:custom-name:abc123
Gives you API access to that model at per-token pricing higher than the base model

You own the training data and the fine-tuning job metadata. You do not own the resulting weights. This distinction matters enormously for portability and cost planning.

Step 3: Exporting Your Training Data

The first step is extracting the training data you used for fine-tuning. If you still have your training files, you’re in luck. If not, you can retrieve your fine-tuning job details and training examples through the OpenAI API.

import openai
import json

# Initialize with your API key
openai.api_key = "your-api-key"

# List your fine-tuning jobs
fine_tune_jobs = openai.FineTuningJob.list(limit=10)

# Get the job ID for your specific fine-tuned model
job_id = "ft-your-job-id"

# Retrieve job details
job = openai.FineTuningJob.retrieve(job_id)
print(f"Training file ID: {job.training_file}")
print(f"Base model: {job.model}")
print(f"Status: {job.status}")

Once you have your training file ID, download the training data:

# Download the training file
training_file_id = job.training_file
file_response = openai.files.content(training_file_id)
training_data = file_response.text

# Parse and save locally
with open("training_data.jsonl", "w") as f:
    f.write(training_data)

print("Training data saved to training_data.jsonl")

This training data becomes your seed for recreating the model locally.

Step 4: Converting OpenAI JSONL Format for Open-Source Training

OpenAI’s fine-tuning format uses a chat messages structure. Most open-source training frameworks expect a slightly different format, so you will need a conversion step:

import json

def convert_openai_to_alpaca(input_path: str, output_path: str) -> None:
    """Convert OpenAI chat fine-tuning JSONL to Alpaca instruction format."""
    converted = []

    with open(input_path) as f:
        for line in f:
            example = json.loads(line.strip())
            messages = example.get("messages", [])

            system_msg = next((m["content"] for m in messages if m["role"] == "system"), "")
            user_msg = next((m["content"] for m in messages if m["role"] == "user"), "")
            assistant_msg = next((m["content"] for m in messages if m["role"] == "assistant"), "")

            converted.append({
                "instruction": user_msg,
                "input": "",
                "output": assistant_msg,
                "system": system_msg
            })

    with open(output_path, "w") as f:
        for item in converted:
            f.write(json.dumps(item) + "\n")

    print(f"Converted {len(converted)} examples to {output_path}")

convert_openai_to_alpaca("training_data.jsonl", "training_alpaca.jsonl")

Step 5: Recreating the Fine-Tuned Model Locally

With your training data extracted, you can now fine-tune an open-source model. The goal is to replicate the behavior of your OpenAI fine-tuned model as closely as possible using local infrastructure.

Choosing Your Base Model

Select a base model that matches the capabilities of your fine-tuned model:

Llama 2 7B — Good balance of capability and hardware requirements
Mistral 7B — Strong performance, efficient inference
Phi-2 — Smaller, faster, suitable for less complex tasks

For most use cases, Mistral 7B provides excellent results with reasonable hardware requirements. A single A100 GPU or even a high-end consumer GPU can handle inference.

Hardware Requirements by Model Size

Model	Minimum VRAM (FP16)	Minimum VRAM (4-bit)	Training VRAM (LoRA)
Phi-2 (2.7B)	6 GB	3 GB	10 GB
Mistral 7B	16 GB	6 GB	20 GB
Llama 2 13B	28 GB	10 GB	40 GB
Llama 2 70B	140 GB	40 GB	80 GB (multi-GPU)

Fine-Tuning Locally with LoRA

Low-Rank Adaptation (LoRA) lets you fine-tune efficiently without retraining the entire model:

# Install required packages
pip install transformers peft accelerate bitsandbytes

# Run fine-tuning with LoRA
python train_lora.py \
    --model mistralai/Mistral-7B-v0.1 \
    --data_path ./training_data.jsonl \
    --output_dir ./local-finetuned-model \
    --num_epochs 3 \
    --batch_size 4 \
    --learning_rate 2e-4 \
    --lora_r 16 \
    --lora_alpha 32

Here is a basic training script:

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load your training data
dataset = load_dataset("json", data_files="training_data.jsonl")

# Training arguments
training_args = TrainingArguments(
    output_dir="./local-finetuned-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_strategy="epoch",
)

# Train (simplified—you would use a proper trainer in production)
# This demonstrates the concept
print("Ready to fine-tune locally")

Step 6: Run Inference Locally

After fine-tuning completes, you can run inference without any external API calls:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    load_in_8bit=True,
    device_map="auto"
)

# Load fine-tuned weights
model = PeftModel.from_pretrained(
    base_model,
    "./local-finetuned-model",
    adapter_name="fine-tuned"
)

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

# Run inference
def generate(prompt, max_tokens=256):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        temperature=0.7,
        top_p=0.9
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test your local model
response = generate("Your prompt here")
print(response)

Step 7: Serving the Model as a Local API

Once you have a working local model, you often want to expose it with an OpenAI-compatible API so that existing code that calls openai.chat.completions.create works without modification. The vllm library makes this straightforward:

pip install vllm

# Serve the merged model with an OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
    --model ./local-finetuned-model-merged \
    --host 0.0.0.0 \
    --port 8000

Your application can then point to http://localhost:8000 instead of the OpenAI API, with zero code changes beyond the base URL.

Step 8: Optimizing for Production

For production deployment, consider these optimizations:

Quantization — Use 4-bit or 8-bit quantization to reduce memory requirements
vLLM — Deploy with vLLM for faster inference throughput
TensorRT-LLM — For maximum performance on NVIDIA hardware

# Example: Loading with 4-bit quantization using bitsandbytes
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    quantization_config=quantization_config,
    device_map="auto"
)

Step 9: Limitations and Tradeoffs

You should know that recreating your fine-tuned model locally will not produce identical results. Differences include:

Base model variations between OpenAI’s and open-source models
Training dynamics and hyperparameters differ from OpenAI’s internal process
Dataset formatting and preprocessing may vary
OpenAI applies RLHF and safety tuning on top of SFT, which you will not replicate

Test thoroughly to ensure the local model meets your quality requirements before migrating production workloads. A practical evaluation strategy: run 50-100 representative prompts through both the OpenAI fine-tuned model and your local reproduction, then score the outputs on your specific quality criteria. Expect 80-90% behavioral parity as a realistic target; achieving exact parity is not possible without access to OpenAI’s internal training pipeline.

Cost Comparison: OpenAI API vs Self-Hosted

One of the primary motivations for local deployment is cost reduction at scale. Here is how the economics typically compare:

Scenario	OpenAI Fine-Tuned (gpt-3.5-turbo)	Self-Hosted Mistral 7B
Monthly inference volume	10M tokens	10M tokens
API cost (input + output)	~$120/month	$0 (after hardware)
Hardware cost (A10G cloud)	—	~$300/month
Break-even volume	—	~25M tokens/month
Latency (P50)	200-400ms	50-150ms (local GPU)
Uptime guarantee	99.9% SLA	Self-managed

At volumes above roughly 25M tokens per month, self-hosting becomes economically favorable. Below that threshold, the simplicity and reliability of OpenAI’s API usually wins. The latency advantage of local inference is significant for user-facing applications regardless of cost.

Step 10: Common Pitfalls to Avoid

Forgetting to merge LoRA adapters before serving. LoRA adapters are stored separately from base model weights. For production serving with vLLM, merge them first:

from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, "./local-finetuned-model")
merged = model.merge_and_unload()
merged.save_pretrained("./local-finetuned-model-merged")
tokenizer.save_pretrained("./local-finetuned-model-merged")

Using mismatched tokenizers. Always load the tokenizer from the same base model checkpoint. Mismatched tokenizers produce garbled outputs that are hard to diagnose.

Skipping chat template formatting. Mistral and Llama models require specific chat templates to produce coherent conversational outputs. Pass messages through the tokenizer’s apply_chat_template method rather than formatting prompts manually.

Troubleshooting

Configuration changes not taking effect

Restart the relevant service or application after making changes. Some settings require a full system reboot. Verify the configuration file path is correct and the syntax is valid.

Permission denied errors

Run the command with sudo for system-level operations, or check that your user account has the necessary permissions. On macOS, you may need to grant terminal access in System Settings > Privacy & Security.

Connection or network-related failures

Check your internet connection and firewall settings. If using a VPN, try disconnecting temporarily to isolate the issue. Verify that the target server or service is accessible from your network.

Frequently Asked Questions

How long does it take to export chatgpt api fine-tuned model for local?

For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.

What are the most common mistakes to avoid?

The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.

Do I need prior experience to follow this guide?

Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.

Will this work with my existing CI/CD pipeline?

The core concepts apply across most CI/CD platforms, though specific syntax and configuration differ. You may need to adapt file paths, environment variable names, and trigger conditions to match your pipeline tool. The underlying workflow logic stays the same.

Where can I get help if I run into issues?

Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.

Built by theluckystrike — More at zovo.one

Table of Contents