Reasoning models are the most powerful — and most deceptive — AI models available in 2026. The sticker price per million tokens only tells half the story. The real cost driver is thinking tokens: the internal chain-of-thought that reasoning models generate before producing your visible answer. These tokens are billed as output but never appear in the response.
Here's everything you need to know about reasoning model pricing across providers, thinking token overhead, and when the premium is actually justified.
[stat] 5–14× How much more expensive a single reasoning model request can be compared to the same request on a standard model, due to thinking token overhead
What are thinking tokens?
When you send a prompt to a reasoning model like OpenAI's o3 or o4-mini, the model doesn't jump straight to an answer. It first generates an internal chain of reasoning — sometimes hundreds or thousands of tokens — working through the problem step by step.
These thinking tokens are generated as output tokens, which means they're billed at the (higher) output token rate. You don't see them in the response, but they show up on your bill.
The volume of thinking tokens depends entirely on the problem's complexity:
- Simple question (factual lookup, basic classification): 200–500 thinking tokens
- Moderate reasoning (code generation, multi-step analysis): 2,000–5,000 thinking tokens
- Complex problem (mathematical proofs, architectural design, constraint satisfaction): 5,000–20,000 thinking tokens
- Extremely hard problem (competition math, novel algorithm design): 20,000–50,000+ thinking tokens
This unpredictability is what makes reasoning model budgeting difficult. A request that looks identical to a simple query might trigger deep reasoning and consume 10× more output tokens than expected.
⚠️ Warning: Thinking tokens are invisible in the response but fully visible on your invoice. A 500-token visible response can cost the same as a 10,000-token response if the model generated 9,500 thinking tokens internally. Always check actual token usage in your API response metadata.
Reasoning model pricing at a glance
Here's every reasoning model currently available, with per-million-token pricing:
| Model | Provider | Input (per 1M) | Output (per 1M) | Context Window | Notes |
|---|---|---|---|---|---|
| GPT-5.2 pro | OpenAI | $21.00 | $168.00 | 1M | Most expensive; highest capability |
| o3-pro | OpenAI | $20.00 | $80.00 | 1M | Premium reasoning |
| o1 | OpenAI | $15.00 | $60.00 | 200K | Original reasoning model |
| Grok 4 | xAI | $3.00 | $15.00 | 256K | Vision + reasoning |
| Magistral Medium | Mistral | $2.00 | $5.00 | 128K | Transparent reasoning |
| o3 | OpenAI | $2.00 | $8.00 | 1M | Advanced reasoning |
| o4-mini | OpenAI | $1.10 | $4.40 | 2M | Efficient reasoning, huge context |
| o3-mini | OpenAI | $1.10 | $4.40 | 500K | Previous-gen efficient reasoning |
| o1-mini | OpenAI | $1.10 | $4.40 | 128K | Original compact reasoning |
| Magistral Small | Mistral | $0.50 | $1.50 | 128K | Budget reasoning |
| DeepSeek R1 V3.2 | DeepSeek | $0.28 | $0.42 | 128K | Cheapest reasoning model |
For comparison, here are the equivalent non-reasoning models:
| Model | Provider | Input (per 1M) | Output (per 1M) |
|---|---|---|---|
| GPT-5.2 | OpenAI | $1.75 | $14.00 |
| GPT-5 | OpenAI | $1.25 | $10.00 |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 |
| GPT-5 mini | OpenAI | $0.25 | $2.00 |
| DeepSeek V3.2 | DeepSeek | $0.28 | $0.42 |
At sticker price, o3 ($2/$8) and GPT-5 ($1.25/$10) look similarly priced. But that comparison is misleading — o3 generates thinking tokens on top of your visible output, making the effective cost per request 3–14× higher.
The real cost: thinking token multiplier in action
Let's work through concrete examples. You ask a coding question with a 1,000-token prompt and expect a 500-token visible answer.
With GPT-5 (no thinking tokens):
- Input: 1,000 tokens × $1.25/1M = $0.00125
- Output: 500 tokens × $10.00/1M = $0.005
- Total: $0.00625 per request
With o3 (moderate reasoning — ~3,000 thinking tokens):
- Input: 1,000 tokens × $2.00/1M = $0.002
- Output: 3,500 tokens (500 visible + 3,000 thinking) × $8.00/1M = $0.028
- Total: $0.030 per request — 4.8× more expensive than GPT-5
With o3 (heavy reasoning — ~10,000 thinking tokens):
- Input: 1,000 tokens × $2.00/1M = $0.002
- Output: 10,500 tokens × $8.00/1M = $0.084
- Total: $0.086 per request — 13.8× more expensive
With o3-pro (heavy reasoning — ~10,000 thinking tokens):
- Input: 1,000 tokens × $20.00/1M = $0.02
- Output: 10,500 tokens × $80.00/1M = $0.84
- Total: $0.86 per request — 138× more expensive than GPT-5
📊 Quick Math: A single complex reasoning request on o3-pro can cost $0.86 — more than many developers spend on AI in an entire day. Before using premium reasoning models, verify that the accuracy improvement justifies a 50–140× cost increase over standard models.
Monthly cost comparison: production workloads
For a production workload of 10,000 requests per day (typical for a SaaS backend), here's what you'd spend monthly at different reasoning intensities:
| Model | Avg Thinking Tokens | Cost/Request | Monthly Cost |
|---|---|---|---|
| DeepSeek V3.2 (standard) | 0 | $0.00027 | $81 |
| GPT-5 mini | 0 | $0.00125 | $375 |
| GPT-5 | 0 | $0.00625 | $1,875 |
| DeepSeek R1 V3.2 | ~2,000 | $0.00133 | $399 |
| o4-mini | ~2,000 | $0.01210 | $3,630 |
| Magistral Small | ~2,000 | $0.00425 | $1,275 |
| o3 | ~3,000 | $0.03000 | $9,000 |
| Magistral Medium | ~3,000 | $0.01700 | $5,100 |
| Grok 4 | ~3,000 | $0.04800 | $14,400 |
| o3-pro | ~5,000 | $0.46000 | $138,000 |
| GPT-5.2 pro | ~5,000 | $0.94500 | $283,500 |
DeepSeek R1 V3.2 stands out as remarkably cost-effective for a reasoning model. At $0.28/$0.42 per million tokens, even with 2,000 thinking tokens per request, it costs just $399/month — comparable to GPT-5 mini without reasoning. It's the only reasoning model that can compete on price with standard models in most budget model rankings.
💡 Key Takeaway: DeepSeek R1 V3.2 is the budget reasoning powerhouse. At $399/month for 10K daily requests with moderate reasoning, it costs 96% less than o3 ($9,000) and 99.7% less than o3-pro ($138,000) for the same workload. If you need reasoning on a budget, start here.
When reasoning models are worth the premium
Reasoning models aren't just "better GPT." They excel at specific tasks where step-by-step logical thinking produces measurably better results.
Worth the premium (accuracy improvements justify cost):
- Complex code generation and debugging — reasoning catches edge cases, handles multi-file dependencies, and produces more correct code on the first attempt
- Multi-step mathematical reasoning — standard models often fail at 3+ step problems where reasoning models maintain accuracy
- Logic puzzles and constraint satisfaction — scheduling, optimization, and rule-based problems
- Scientific analysis requiring careful deduction and evidence evaluation
- Legal and medical reasoning where errors have real consequences
- Agentic workflows where the model needs to plan and execute multi-step tasks
Not worth the premium (standard models perform equally well):
- Simple Q&A or chatbot conversations
- Text summarization (reasoning overhead adds cost without improving quality)
- Translation (language tasks don't benefit from chain-of-thought)
- Content generation (creative writing, marketing copy)
- Classification tasks (labels don't need reasoning)
- Data extraction and formatting
The accuracy test: If your accuracy on a task improves from 70% to 95% with a reasoning model, and errors cost you money (wrong code, bad analysis, incorrect recommendations), the 5–14× price increase easily pays for itself. If accuracy only improves from 90% to 92%, the premium rarely justifies the cost.
Five strategies to control reasoning model costs
1. Use reasoning effort settings
OpenAI's o-series models support a reasoning_effort parameter with three levels: low, medium, and high. Lower effort = fewer thinking tokens = lower cost.
| Effort Level | Typical Thinking Tokens | Relative Cost |
|---|---|---|
| Low | 500–1,000 | 1× (baseline) |
| Medium | 2,000–5,000 | 3–5× |
| High | 5,000–20,000 | 10–20× |
For many tasks, medium gives 80% of high's quality at 40% of the thinking token cost. Start with medium and only escalate to high for problems that demonstrably benefit.
2. Route by complexity
Don't send every request to a reasoning model. Use a cheap model (GPT-5 nano at $0.05/$0.40 or Mistral Small 3.2 at $0.06/$0.18) as a router to classify request difficulty. Only escalate complex requests to reasoning models.
Typical distribution for a coding assistant:
- 60% simple requests → GPT-5 mini ($0.25/$2.00)
- 30% moderate → DeepSeek R1 V3.2 ($0.28/$0.42)
- 10% complex → o3 ($2.00/$8.00)
This routing approach cuts reasoning model costs by 70–90% compared to sending everything to o3. Read our cost optimization guide for implementation details.
3. Set max completion tokens
Cap your output tokens to prevent runaway thinking. If a task should take 500 tokens to answer, setting max_completion_tokens to 5,000 prevents the model from spending 50,000 tokens reasoning about edge cases.
This is especially important for o3-pro and GPT-5.2 pro, where uncapped thinking on a complex problem can generate $1+ per request. Hard limits protect your budget.
4. Consider DeepSeek R1 V3.2
At $0.28/$0.42 per million tokens, DeepSeek R1 V3.2 offers chain-of-thought reasoning at standard-model prices. For many use cases — code generation, math, logic problems — it delivers reasoning capability at a fraction of o3's cost. The tradeoff: smaller context window (128K vs o3's 1M) and less polish on edge cases.
5. Monitor thinking token usage
Track actual thinking token counts per request type. OpenAI's API returns thinking token counts in the usage metadata. Log this data and analyze it weekly:
- Are certain prompt patterns triggering excessive thinking?
- Can you rephrase prompts to reduce reasoning depth?
- Are there request types where thinking tokens add no measurable quality?
Use this data to continuously refine your routing rules and effort settings.
⚠️ Warning: Reasoning model costs are inherently unpredictable because thinking token volume varies by problem difficulty. Always set hard spending caps with your provider and max_completion_tokens on every request. A single runaway request on o3-pro can cost more than your entire daily budget.
Reasoning model comparison: cost efficiency ranking
For a standardized workload (1,000 input tokens, 500 visible output tokens, 3,000 thinking tokens), here's how every reasoning model compares on cost per request:
| Model | Cost/Request | Relative Cost |
|---|---|---|
| DeepSeek R1 V3.2 | $0.0018 | 1× (baseline) |
| Magistral Small | $0.0073 | 4.1× |
| o4-mini | $0.0165 | 9.2× |
| o3-mini | $0.0165 | 9.2× |
| Magistral Medium | $0.0195 | 10.8× |
| o3 | $0.0300 | 16.7× |
| Grok 4 | $0.0480 | 26.7× |
| o1 | $0.2250 | 125× |
| o3-pro | $0.3000 | 167× |
| GPT-5.2 pro | $0.6090 | 338× |
DeepSeek R1 is 338× cheaper than GPT-5.2 pro per reasoning request. Even compared to o3 — the most commonly used production reasoning model — it's 16.7× cheaper, which mirrors what we see in our DeepSeek vs GPT-5 mini breakdown.
The bottom line
Reasoning models are powerful but expensive — not because of their sticker price, but because of the hidden thinking token overhead. Before choosing a reasoning model:
- Estimate your thinking token ratio. Test with real prompts and check actual token usage in the API response.
- Compare total cost, not just per-token price. A "cheaper" reasoning model that generates more thinking tokens can cost more than an "expensive" one, especially when tokens-per-dollar efficiency differs by output mix.
- Route smartly. Use reasoning models only where they add measurable value. Send everything else to standard models.
- Start with DeepSeek R1. At $0.28/$0.42, it's the cheapest way to access reasoning capabilities. Escalate to o3 or o4-mini only when DeepSeek R1 falls short.
- Track your spend. Monitor thinking token usage weekly — it varies by prompt and can creep up.
Use our calculator to estimate monthly costs with thinking token overhead, or check the comparison pages to see how reasoning models stack up against standard models for your specific use case.
Frequently asked questions
Do all reasoning models charge for thinking tokens?
Yes — thinking tokens are billed as output tokens across all providers. The impact varies enormously by pricing: DeepSeek R1 charges $0.42/M for thinking tokens (barely noticeable), while o3-pro charges $80/M (budget-breaking). Always calculate your effective cost including estimated thinking token volume.
Can I see the thinking tokens in the API response?
OpenAI's API returns the thinking token count in the usage metadata (completion_tokens_details.reasoning_tokens), but not the actual content. You can see how many tokens were used for reasoning versus the visible response. This data is essential for cost monitoring and optimization.
How many thinking tokens does a typical request use?
It varies enormously by problem complexity. Simple tasks: 500–2,000 thinking tokens. Moderate reasoning: 2,000–5,000. Complex problems: 5,000–20,000+. Math competition problems can generate 50,000+ thinking tokens for a single response. The unpredictability is why spending caps and monitoring are non-negotiable.
Is o4-mini better than o3-mini?
Yes. o4-mini is the successor to o3-mini at the same price point ($1.10/$4.40). It offers improved reasoning capability and a larger 2M context window (versus o3-mini's 500K). There's no reason to use o3-mini for new projects — o4-mini is strictly better at the same price.
When should I use DeepSeek R1 versus o3?
Start with DeepSeek R1 V3.2 for all reasoning tasks. It costs 16.7× less than o3 per request. Only escalate to o3 when: (1) you need >128K context, (2) DeepSeek R1's accuracy measurably falls short on your specific task, or (3) you need OpenAI's ecosystem features (function calling format, specific API guarantees). For most coding and reasoning tasks, DeepSeek R1 delivers comparable quality at a fraction of the cost.
