Last updated: March 16, 2026
Export fine-tuned ChatGPT models by calling the OpenAI API with your fine-tuned model ID—you cannot directly download the weights but can run inference locally with proper API integration. This guide explains the limitations and the practical workflow for local deployment.
Table of Contents
Prerequisites
Before you begin, make sure you have the following ready:
- A computer running macOS, Linux, or Windows
- Terminal or command-line access
- Administrator or sudo privileges (for system-level changes)
- A stable internet connection for downloading tools
Step 1: Understand What You Actually Have
Before attempting an export, you need to understand the technical reality: OpenAI does not provide a direct download mechanism for fine-tuned model weights. Their fine-tuning service creates a model that lives on their infrastructure, and you access it exclusively through their API. This is fundamentally different from open-source fine-tuning where you have full access to the model files.
However, several legitimate approaches exist to achieve local fine-tuned inference:
-
Recreate the fine-tuning using open-source models — Take your training data and fine-tune an open-source model like Llama 2, Mistral, or Phi on it
-
Use distilled models — Apply knowledge distillation techniques to create a smaller local model
-
Convert OpenAI format models — If you have access to compatible model formats through other means
The most practical path for most developers involves recreating the fine-tuning on an open-source base model.
Step 2: What OpenAI’s Fine-Tuning API Actually Gives You
OpenAI’s fine-tuning API currently supports base models including gpt-4o-mini, gpt-3.5-turbo, and several others. When you submit a fine-tuning job, OpenAI:
- Takes your JSONL training file containing prompt-completion pairs
- Runs supervised fine-tuning (SFT) on their proprietary model infrastructure
- Creates a model variant with a unique ID like
ft:gpt-3.5-turbo:your-org:custom-name:abc123 - Gives you API access to that model at per-token pricing higher than the base model
You own the training data and the fine-tuning job metadata. You do not own the resulting weights. This distinction matters enormously for portability and cost planning.
Step 3: Exporting Your Training Data
The first step is extracting the training data you used for fine-tuning. If you still have your training files, you’re in luck. If not, you can retrieve your fine-tuning job details and training examples through the OpenAI API.
import openai
import json
# Initialize with your API key
openai.api_key = "your-api-key"
# List your fine-tuning jobs
fine_tune_jobs = openai.FineTuningJob.list(limit=10)
# Get the job ID for your specific fine-tuned model
job_id = "ft-your-job-id"
# Retrieve job details
job = openai.FineTuningJob.retrieve(job_id)
print(f"Training file ID: {job.training_file}")
print(f"Base model: {job.model}")
print(f"Status: {job.status}")
Once you have your training file ID, download the training data:
# Download the training file
training_file_id = job.training_file
file_response = openai.files.content(training_file_id)
training_data = file_response.text
# Parse and save locally
with open("training_data.jsonl", "w") as f:
f.write(training_data)
print("Training data saved to training_data.jsonl")
This training data becomes your seed for recreating the model locally.
Step 4: Converting OpenAI JSONL Format for Open-Source Training
OpenAI’s fine-tuning format uses a chat messages structure. Most open-source training frameworks expect a slightly different format, so you will need a conversion step:
import json
def convert_openai_to_alpaca(input_path: str, output_path: str) -> None:
"""Convert OpenAI chat fine-tuning JSONL to Alpaca instruction format."""
converted = []
with open(input_path) as f:
for line in f:
example = json.loads(line.strip())
messages = example.get("messages", [])
system_msg = next((m["content"] for m in messages if m["role"] == "system"), "")
user_msg = next((m["content"] for m in messages if m["role"] == "user"), "")
assistant_msg = next((m["content"] for m in messages if m["role"] == "assistant"), "")
converted.append({
"instruction": user_msg,
"input": "",
"output": assistant_msg,
"system": system_msg
})
with open(output_path, "w") as f:
for item in converted:
f.write(json.dumps(item) + "\n")
print(f"Converted {len(converted)} examples to {output_path}")
convert_openai_to_alpaca("training_data.jsonl", "training_alpaca.jsonl")
Step 5: Recreating the Fine-Tuned Model Locally
With your training data extracted, you can now fine-tune an open-source model. The goal is to replicate the behavior of your OpenAI fine-tuned model as closely as possible using local infrastructure.
Choosing Your Base Model
Select a base model that matches the capabilities of your fine-tuned model:
-
Llama 2 7B — Good balance of capability and hardware requirements
-
Mistral 7B — Strong performance, efficient inference
-
Phi-2 — Smaller, faster, suitable for less complex tasks
For most use cases, Mistral 7B provides excellent results with reasonable hardware requirements. A single A100 GPU or even a high-end consumer GPU can handle inference.
Hardware Requirements by Model Size
| Model | Minimum VRAM (FP16) | Minimum VRAM (4-bit) | Training VRAM (LoRA) |
|---|---|---|---|
| Phi-2 (2.7B) | 6 GB | 3 GB | 10 GB |
| Mistral 7B | 16 GB | 6 GB | 20 GB |
| Llama 2 13B | 28 GB | 10 GB | 40 GB |
| Llama 2 70B | 140 GB | 40 GB | 80 GB (multi-GPU) |
Fine-Tuning Locally with LoRA
Low-Rank Adaptation (LoRA) lets you fine-tune efficiently without retraining the entire model:
# Install required packages
pip install transformers peft accelerate bitsandbytes
# Run fine-tuning with LoRA
python train_lora.py \
--model mistralai/Mistral-7B-v0.1 \
--data_path ./training_data.jsonl \
--output_dir ./local-finetuned-model \
--num_epochs 3 \
--batch_size 4 \
--learning_rate 2e-4 \
--lora_r 16 \
--lora_alpha 32
Here is a basic training script:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
# Load base model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
load_in_8bit=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Load your training data
dataset = load_dataset("json", data_files="training_data.jsonl")
# Training arguments
training_args = TrainingArguments(
output_dir="./local-finetuned-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4,
logging_steps=10,
save_strategy="epoch",
)
# Train (simplified—you would use a proper trainer in production)
# This demonstrates the concept
print("Ready to fine-tune locally")
Step 6: Run Inference Locally
After fine-tuning completes, you can run inference without any external API calls:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
load_in_8bit=True,
device_map="auto"
)
# Load fine-tuned weights
model = PeftModel.from_pretrained(
base_model,
"./local-finetuned-model",
adapter_name="fine-tuned"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# Run inference
def generate(prompt, max_tokens=256):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
top_p=0.9
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test your local model
response = generate("Your prompt here")
print(response)
Step 7: Serving the Model as a Local API
Once you have a working local model, you often want to expose it with an OpenAI-compatible API so that existing code that calls openai.chat.completions.create works without modification. The vllm library makes this straightforward:
pip install vllm
# Serve the merged model with an OpenAI-compatible endpoint
python -m vllm.entrypoints.openai.api_server \
--model ./local-finetuned-model-merged \
--host 0.0.0.0 \
--port 8000
Your application can then point to http://localhost:8000 instead of the OpenAI API, with zero code changes beyond the base URL.
Step 8: Optimizing for Production
For production deployment, consider these optimizations:
-
Quantization — Use 4-bit or 8-bit quantization to reduce memory requirements
-
vLLM — Deploy with vLLM for faster inference throughput
-
TensorRT-LLM — For maximum performance on NVIDIA hardware
# Example: Loading with 4-bit quantization using bitsandbytes
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=quantization_config,
device_map="auto"
)
Step 9: Limitations and Tradeoffs
You should know that recreating your fine-tuned model locally will not produce identical results. Differences include:
-
Base model variations between OpenAI’s and open-source models
-
Training dynamics and hyperparameters differ from OpenAI’s internal process
-
Dataset formatting and preprocessing may vary
-
OpenAI applies RLHF and safety tuning on top of SFT, which you will not replicate
Test thoroughly to ensure the local model meets your quality requirements before migrating production workloads. A practical evaluation strategy: run 50-100 representative prompts through both the OpenAI fine-tuned model and your local reproduction, then score the outputs on your specific quality criteria. Expect 80-90% behavioral parity as a realistic target; achieving exact parity is not possible without access to OpenAI’s internal training pipeline.
Cost Comparison: OpenAI API vs Self-Hosted
One of the primary motivations for local deployment is cost reduction at scale. Here is how the economics typically compare:
| Scenario | OpenAI Fine-Tuned (gpt-3.5-turbo) | Self-Hosted Mistral 7B |
|---|---|---|
| Monthly inference volume | 10M tokens | 10M tokens |
| API cost (input + output) | ~$120/month | $0 (after hardware) |
| Hardware cost (A10G cloud) | — | ~$300/month |
| Break-even volume | — | ~25M tokens/month |
| Latency (P50) | 200-400ms | 50-150ms (local GPU) |
| Uptime guarantee | 99.9% SLA | Self-managed |
At volumes above roughly 25M tokens per month, self-hosting becomes economically favorable. Below that threshold, the simplicity and reliability of OpenAI’s API usually wins. The latency advantage of local inference is significant for user-facing applications regardless of cost.
Step 10: Common Pitfalls to Avoid
Forgetting to merge LoRA adapters before serving. LoRA adapters are stored separately from base model weights. For production serving with vLLM, merge them first:
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, "./local-finetuned-model")
merged = model.merge_and_unload()
merged.save_pretrained("./local-finetuned-model-merged")
tokenizer.save_pretrained("./local-finetuned-model-merged")
Using mismatched tokenizers. Always load the tokenizer from the same base model checkpoint. Mismatched tokenizers produce garbled outputs that are hard to diagnose.
Skipping chat template formatting. Mistral and Llama models require specific chat templates to produce coherent conversational outputs. Pass messages through the tokenizer’s apply_chat_template method rather than formatting prompts manually.
Troubleshooting
Configuration changes not taking effect
Restart the relevant service or application after making changes. Some settings require a full system reboot. Verify the configuration file path is correct and the syntax is valid.
Permission denied errors
Run the command with sudo for system-level operations, or check that your user account has the necessary permissions. On macOS, you may need to grant terminal access in System Settings > Privacy & Security.
Connection or network-related failures
Check your internet connection and firewall settings. If using a VPN, try disconnecting temporarily to isolate the issue. Verify that the target server or service is accessible from your network.
Related Reading
- ChatGPT API Fine Tuning Costs Training Plus Inference Total
- How to Set Up Model Context Protocol with Local Database
- Export Perplexity Collections Before Switching to ChatGPT Se
- How to Export ChatGPT Shared Links Before Account Deletion
- Gemini Flash vs Pro API Pricing: When to Use Which Model
Frequently Asked Questions
How long does it take to export chatgpt api fine-tuned model for local?
For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.
What are the most common mistakes to avoid?
The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.
Do I need prior experience to follow this guide?
Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.
Will this work with my existing CI/CD pipeline?
The core concepts apply across most CI/CD platforms, though specific syntax and configuration differ. You may need to adapt file paths, environment variable names, and trigger conditions to match your pipeline tool. The underlying workflow logic stays the same.
Where can I get help if I run into issues?
Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.
Related Articles
- Ollama vs LM Studio for Local Model Serving
- ChatGPT API Fine Tuning Costs Training Plus Inference Total
- How to Create Model Context Protocol Server That Serves API
- ChatGPT API 429 Too Many Requests
- Cursor AI Model Selection Guide Which Model for Which Coding
Built by theluckystrike — More at zovo.one