Last updated: March 16, 2026
Claude API bills extended thinking tokens at the standard output token rate. When extended thinking is enabled, the model generates internal reasoning tokens that are invisible in the final response but count toward your output token total. This means your output token cost increases proportionally to the complexity of the reasoning task, even though only the final answer appears in the response. Below is a practical breakdown of the billing mechanism, cost implications, and strategies to optimize your spending.
Table of Contents
- What Is Extended Thinking in Claude API
- How Output Tokens Are Billed
- Current Pricing Structure
- Practical Code Examples
- Understanding Token Usage in Responses
- How Much Do Reasoning Tokens Add?
- Cost Optimization Strategies
- Monitoring Your Spending
- Building a Cost-Aware Routing Layer
- Step-by-Step Workflow: Benchmarking Extended Thinking for Your Use Case
What Is Extended Thinking in Claude API
Extended thinking is a feature that allows Claude models to engage in deeper reasoning before producing their final response. When enabled, the model breaks down complex problems, explores multiple approaches, and reasons through its solution before delivering the actual output. This results in more thoughtful, accurate responses for tasks that require complex reasoning, coding, or analysis.
The feature works by having the model generate internal reasoning tokens that are not visible in the final response but are still processed and billed as output tokens. These reasoning tokens represent the model’s “thought process” as it works through your request.
How Output Tokens Are Billed
When using the Claude API, you’re charged based on the number of tokens processed—both input tokens (what you send) and output tokens (what the model generates). Extended thinking specifically affects output token billing because the reasoning process generates additional tokens beyond the visible response.
Here’s the basic cost structure:
-
Input tokens: Charged at the model’s input rate per 1,000 tokens
-
Output tokens: Charged at the model’s output rate per 1,000 tokens
Extended thinking increases output tokens because the model generates reasoning tokens before producing its final answer. These reasoning tokens count toward your output token limit and are billed at the standard output token rate.
Current Pricing Structure
Claude API pricing varies by model. Here’s a general breakdown for the main models:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|——-|———————|———————-|
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Opus | $15.00 | $75.00 |
When you enable extended thinking, the output tokens include both the visible response and the internal reasoning tokens. The exact number of reasoning tokens depends on the complexity of your request.
Practical Code Examples
Here’s how to enable extended thinking in your API calls:
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
extra_headers={
"anthropic-beta": "extended-thinking-2025-05-01"
},
messages=[
{"role": "user", "content": "Explain how quicksort works and implement it in Python"}
]
)
print(message.content)
print(f"Usage: {message.usage}")
In this example, message.usage will show both input and output tokens, including any reasoning tokens generated by extended thinking.
For Node.js applications, the equivalent code looks like this:
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const message = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 4096,
extra_headers: {
'anthropic-beta': 'extended-thinking-2025-05-01',
},
messages: [
{
role: 'user',
content: 'Explain how quicksort works and implement it in Python'
}
],
});
console.log(message.content);
console.log(message.usage);
Understanding Token Usage in Responses
To understand exactly how extended thinking affects your billing, examine the usage object in the API response:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
extra_headers={"anthropic-beta": "extended-thinking-2025-05-01"},
messages=[{"role": "user", "content": "Solve this complex problem..."}]
)
# Usage breakdown
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
# Extended thinking tokens are included in output_tokens
The output_tokens field includes both the visible response tokens and the reasoning tokens. You can estimate the reasoning token count by comparing output counts between requests with and without extended thinking for similar queries.
How Much Do Reasoning Tokens Add?
The number of reasoning tokens varies significantly based on task complexity. Simple questions generate few or no reasoning tokens. Complex multi-step problems can generate hundreds or thousands of reasoning tokens before the model delivers its answer.
Here are approximate ranges based on task type, using Claude 3.5 Sonnet pricing at $15.00 per million output tokens:
| Task Type | Approx Reasoning Tokens | Extra Cost per Request |
|---|---|---|
| Simple factual lookup | 0–50 | ~$0.0008 |
| Basic code generation | 100–300 | ~$0.003 |
| Algorithm design | 500–1,500 | ~$0.015 |
| Complex debugging | 1,000–3,000 | ~$0.03 |
| System architecture | 2,000–6,000 | ~$0.075 |
These are estimates — actual token counts depend on the model version, your specific prompt, and how deep the reasoning chain runs. Monitor your actual usage with the tracking pattern described in the next section to build an accurate picture for your workload.
Cost Optimization Strategies
Managing costs with extended thinking requires careful consideration of when to use the feature:
Use extended thinking for:
-
Complex coding problems requiring multi-step reasoning
-
Mathematical and analytical tasks
-
Architecture and system design questions
-
Debugging and code review tasks
Skip extended thinking for:
-
Simple queries and factual lookups
-
Straightforward text generation
-
Basic summarization tasks
-
Quick translations
You can also control costs by setting appropriate max_tokens limits:
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024, # Limit output to control costs
extra_headers={"anthropic-beta": "extended-thinking-2025-05-01"},
messages=[{"role": "user", "content": "Your prompt here"}]
)
Monitoring Your Spending
Implement logging to track your extended thinking usage over time:
import time
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
def track_request(prompt, enable_extended_thinking=True):
headers = {}
if enable_extended_thinking:
headers["anthropic-beta"] = "extended-thinking-2025-05-01"
start_time = time.time()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
extra_headers=headers,
messages=[{"role": "user", "content": prompt}]
)
duration = time.time() - start_time
cost = (response.usage.input_tokens / 1_000_000 * 3.0) + \
(response.usage.output_tokens / 1_000_000 * 15.0)
print(f"Tokens: {response.usage.input_tokens} in / {response.usage.output_tokens} out")
print(f"Cost: ${cost:.4f}")
print(f"Duration: {duration:.2f}s")
return response, cost
# Test with extended thinking
response, cost = track_request("Explain the time complexity of merge sort")
Building a Cost-Aware Routing Layer
For applications that serve a mix of simple and complex queries, routing requests dynamically between models with and without extended thinking can significantly reduce costs without degrading quality on the queries that need it.
import anthropic
client = anthropic.Anthropic(api_key="your-api-key")
COMPLEX_KEYWORDS = [
"implement", "design", "architecture", "debug", "optimize",
"algorithm", "refactor", "explain how", "compare"
]
def classify_complexity(prompt: str) -> str:
"""Return 'complex' if the prompt warrants extended thinking."""
prompt_lower = prompt.lower()
if any(kw in prompt_lower for kw in COMPLEX_KEYWORDS):
return "complex"
if len(prompt.split()) > 50:
return "complex"
return "simple"
def smart_completion(prompt: str) -> dict:
complexity = classify_complexity(prompt)
use_extended = complexity == "complex"
headers = {}
if use_extended:
headers["anthropic-beta"] = "extended-thinking-2025-05-01"
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
extra_headers=headers,
messages=[{"role": "user", "content": prompt}]
)
return {
"content": response.content[0].text,
"extended_thinking_used": use_extended,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
This pattern works well for internal tooling and developer assistants where you have some control over the query distribution. For customer-facing products, consider a fallback approach: start without extended thinking and retry with it only when the initial response fails a quality check.
Step-by-Step Workflow: Benchmarking Extended Thinking for Your Use Case
Before committing to extended thinking across your application, run a structured benchmark to measure whether the quality improvement justifies the cost increase for your specific workload.
Step 1: Select representative prompts. Pick 20–30 prompts from your actual production logs that span the range of complexity in your workload. Include both simple and complex cases.
Step 2: Run each prompt twice. Send each prompt once with extended thinking enabled and once without. Use the same max_tokens value for both runs. Record the input tokens, output tokens, and response content for each.
Step 3: Score the responses. Use a simple rubric: correctness (0–3), completeness (0–3), and clarity (0–2). Have at least two reviewers score each response independently and average the scores.
Step 4: Calculate the cost delta. For each prompt pair, compute the extra cost from extended thinking: (extended_output_tokens - standard_output_tokens) / 1_000_000 * output_rate.
Step 5: Compute quality-per-dollar. Divide the quality improvement score by the extra cost for each prompt pair. This ratio tells you where extended thinking delivers the best return. Prompts with high ratios belong in the “always use extended thinking” bucket; prompts with ratios near zero or negative belong in the “never use” bucket.
Step 6: Implement tiered routing. Use the prompt characteristics you identified in step 5 to build a classifier (as shown in the routing example above) that assigns each incoming request to the appropriate tier.
Frequently Asked Questions
Are reasoning tokens visible in the API response?
No. Reasoning tokens are internal to the model’s processing and do not appear in the response content. You can infer their approximate count by comparing output_tokens between requests with and without extended thinking on similar prompts.
Does extended thinking always improve response quality? Not always. For simple questions, extended thinking adds tokens without meaningfully improving the answer. The feature provides the most benefit for tasks that genuinely require multi-step reasoning, such as algorithm design, complex debugging, or mathematical proofs.
Can I set a maximum budget for reasoning tokens?
Not directly. You control the overall max_tokens budget, which includes both reasoning and response tokens. Setting a lower max_tokens limit constrains the total output but does not separately cap reasoning tokens.
How does extended thinking affect response latency? Extended thinking increases latency because the model must complete its reasoning process before generating the visible response. Expect latency to increase roughly in proportion to the number of reasoning tokens generated.
Is extended thinking available on all Claude models? Extended thinking availability varies by model version. Check the Anthropic API documentation for the current list of supported models and the required beta header values.
Should I use extended thinking in production by default? No. The recommended pattern is to enable it selectively based on query complexity. Using it by default will increase costs substantially for the majority of requests that do not benefit from extended reasoning.
Related Articles
- Claude API Tool Use Function Calling Pricing How Tokens Are
- How to Use Claude API Cheaply for Small Coding Projects
- ChatGPT API Assistants API Pricing Threads and Runs Cost
- Claude API vs OpenAI API Pricing Breakdown 2026
- Claude API Pay Per Token vs Pro Subscription Which Cheaper Built by theluckystrike — More at zovo.one