Most AI pricing guides show you per-token rates. That's useful for prototyping. It's useless for production.
When you're processing a million customer support tickets, summarizing a million documents, or running AI on every transaction in your pipeline, the math changes dramatically. A $0.10 difference in per-million-token pricing can translate to tens of thousands of dollars per month at scale. The model that looked cheap in your proof-of-concept becomes a budget disaster at 1M requests.
This guide does the math nobody else does: exact costs for processing 1 million requests across every major AI provider in 2026, broken down by use case complexity, with the optimization strategies that actually matter at high volume.
[stat] $8 to $349,000 The cost range for processing 1 million AI requests, depending on model choice and task complexity
How we calculated these numbers
Every cost in this guide uses a consistent methodology based on real-world token usage patterns. We defined three request profiles that cover the majority of production AI workloads:
Simple requests (classification, sentiment, routing): ~200 input tokens, ~50 output tokens per request. Think email triage, content moderation, or intent detection.
Medium requests (summarization, Q&A, extraction): ~800 input tokens, ~300 output tokens per request. This covers customer support responses, document summarization, and structured data extraction.
Complex requests (analysis, generation, reasoning): ~2,000 input tokens, ~1,000 output tokens per request. Long-form content generation, code review, multi-step reasoning tasks.
All pricing comes directly from our calculator's model database, updated as of March 2026. No cached pricing, no guesswork.
💡 Key Takeaway: Token counts vary wildly by use case. A customer support bot averaging 300 output tokens per response costs 6x more than a classifier outputting 50 tokens. Profile your actual usage before forecasting.
The complete cost table: 1 million requests by model
Simple requests (200 input / 50 output tokens)
For lightweight tasks like classification, sentiment analysis, and routing, here's what 1M requests costs:
| Model | Input Cost | Output Cost | Total (1M requests) | Cost per request |
|---|---|---|---|---|
| GPT-5.4 nano | $0.04 | $0.06 | $0.10 | $0.0000001 |
| Gemini 2.0 Flash-Lite | $0.01 | $0.02 | $0.03 | $0.00000003 |
| Mistral Small 3.2 | $0.01 | $0.01 | $0.02 | $0.00000002 |
| GPT-5 nano | $0.01 | $0.02 | $0.03 | $0.00000003 |
| Gemini 2.0 Flash | $0.02 | $0.02 | $0.04 | $0.00000004 |
| DeepSeek V3.2 | $0.06 | $0.02 | $0.08 | $0.00000008 |
| GPT-4.1 nano | $0.02 | $0.02 | $0.04 | $0.00000004 |
| GPT-5.4 mini | $0.15 | $0.23 | $0.38 | $0.00000038 |
| Claude Haiku 4.5 | $0.20 | $0.25 | $0.45 | $0.00000045 |
| GPT-5.4 | $0.50 | $0.75 | $1.25 | $0.00000125 |
| Claude Sonnet 4.6 | $0.60 | $0.75 | $1.35 | $0.00000135 |
| GPT-5.4 Pro | $6.00 | $9.00 | $15.00 | $0.000015 |
| Claude Opus 4.6 | $1.00 | $1.25 | $2.25 | $0.00000225 |
For simple classification at scale, the budget models are essentially free. Mistral Small 3.2 processes 1 million requests for $0.02 — less than a cup of coffee costs in Malaysia.
Medium requests (800 input / 300 output tokens)
Customer support, summarization, and data extraction — the bread and butter of production AI:
| Model | Input Cost | Output Cost | Total (1M requests) | Cost per request |
|---|---|---|---|---|
| Gemini 2.0 Flash-Lite | $0.06 | $0.09 | $0.15 | $0.00000015 |
| Mistral Small 3.2 | $0.05 | $0.05 | $0.10 | $0.0000001 |
| GPT-5.4 nano | $0.16 | $0.38 | $0.54 | $0.00000054 |
| GPT-5 nano | $0.04 | $0.12 | $0.16 | $0.00000016 |
| Gemini 2.0 Flash | $0.08 | $0.12 | $0.20 | $0.0000002 |
| DeepSeek V3.2 | $0.22 | $0.13 | $0.35 | $0.00000035 |
| GPT-4.1 nano | $0.08 | $0.12 | $0.20 | $0.0000002 |
| GPT-5.4 mini | $0.60 | $1.35 | $1.95 | $0.00000195 |
| Claude Haiku 4.5 | $0.80 | $1.50 | $2.30 | $0.0000023 |
| Grok 4.1 Fast | $0.16 | $0.15 | $0.31 | $0.00000031 |
| GPT-5.4 | $2.00 | $4.50 | $6.50 | $0.0000065 |
| Claude Sonnet 4.6 | $2.40 | $4.50 | $6.90 | $0.0000069 |
| Gemini 3.1 Pro | $1.60 | $3.60 | $5.20 | $0.0000052 |
| Claude Opus 4.6 | $4.00 | $7.50 | $11.50 | $0.0000115 |
| GPT-5.4 Pro | $24.00 | $54.00 | $78.00 | $0.000078 |
At medium complexity, the spread starts to matter. DeepSeek V3.2 at $0.35 versus Claude Opus 4.6 at $11.50 — that's a 33x cost difference for a million requests. The question is whether the quality gap justifies it.
📊 Quick Math: A SaaS product handling 10,000 customer support tickets per day hits 1M requests in about 100 days. At DeepSeek V3.2 rates, that's $1.28/month. At Claude Opus 4.6, it's $41.98/month. Manageable either way — but at 100K tickets/day, DeepSeek costs $12.78/month while Opus costs $419.75.
Complex requests (2,000 input / 1,000 output tokens)
Long-form generation, code analysis, and multi-step reasoning — where costs get serious:
| Model | Input Cost | Output Cost | Total (1M requests) | Cost per request |
|---|---|---|---|---|
| Gemini 2.0 Flash-Lite | $0.14 | $0.30 | $0.44 | $0.00000044 |
| Mistral Small 3.2 | $0.12 | $0.18 | $0.30 | $0.0000003 |
| GPT-5.4 nano | $0.40 | $1.25 | $1.65 | $0.00000165 |
| GPT-5 nano | $0.10 | $0.40 | $0.50 | $0.0000005 |
| DeepSeek V3.2 | $0.56 | $0.42 | $0.98 | $0.00000098 |
| Gemini 2.0 Flash | $0.20 | $0.40 | $0.60 | $0.0000006 |
| GPT-5.4 mini | $1.50 | $4.50 | $6.00 | $0.000006 |
| Grok 4.1 Fast | $0.40 | $0.50 | $0.90 | $0.0000009 |
| Claude Haiku 4.5 | $2.00 | $5.00 | $7.00 | $0.000007 |
| Llama 4 Maverick | $0.54 | $0.85 | $1.39 | $0.00000139 |
| GPT-5.4 | $5.00 | $15.00 | $20.00 | $0.00002 |
| Claude Sonnet 4.6 | $6.00 | $15.00 | $21.00 | $0.000021 |
| Gemini 3.1 Pro | $4.00 | $12.00 | $16.00 | $0.000016 |
| Claude Opus 4.6 | $10.00 | $25.00 | $35.00 | $0.000035 |
| GPT-5.4 Pro | $60.00 | $180.00 | $240.00 | $0.00024 |
| o1 | $30.00 | $60.00 | $90.00 | $0.00009 |
| GPT-5.2 pro | $42.00 | $168.00 | $210.00 | $0.00021 |
⚠️ Warning: Reasoning models like GPT-5.4 Pro, o1, and GPT-5.2 pro generate internal "thinking" tokens that you pay for but don't see in the output. A request that appears to use 1,000 output tokens might actually consume 5,000-15,000 tokens internally. The costs above are base estimates — reasoning-heavy tasks can multiply these by 3-10x.
The real cost of reasoning models at scale
Reasoning models deserve their own section because they're the biggest cost trap in production AI.
Models like GPT-5.4 Pro ($30/$180 per million tokens), o3-pro ($20/$80), and GPT-5.2 pro ($21/$168) use chain-of-thought reasoning that generates hidden thinking tokens. These tokens count toward your bill but don't appear in the API response.
Here's what this means at scale:
A "1,000 output token" response from GPT-5.4 Pro might actually generate 8,000-12,000 total output tokens (thinking + visible). Instead of paying $0.18 per request, you're paying $1.44-$2.16 per request. At 1 million requests, that's not $240 — it's $1.4M to $2.1M.
| Model | Listed output price | Effective price (with reasoning) | 1M complex requests (actual) |
|---|---|---|---|
| GPT-5.4 Pro | $180/M | $540-$1,800/M | $600-$1,860 |
| o3-pro | $80/M | $240-$800/M | $260-$820 |
| GPT-5.2 pro | $168/M | $504-$1,680/M | $546-$1,722 |
| o3 | $8/M | $24-$80/M | $30-$84 |
| o4-mini | $4.40/M | $13.20-$44/M | $14.30-$45.10 |
💡 Key Takeaway: Never budget for reasoning models using listed per-token rates. Multiply output costs by 3-10x for realistic estimates, depending on task complexity. Use our calculator to model different thinking token ratios.
The smartest approach: use reasoning models only where they measurably outperform standard models, and route everything else to cheaper alternatives. A model routing strategy can cut reasoning model spend by 70-80% while maintaining quality where it matters.
Annual cost projections: when scale gets real
One million requests is a milestone, not a ceiling. Many production systems process millions of requests per day. Here's what sustained high-volume usage looks like annually:
1M requests per day, 365 days (medium complexity)
| Model | Daily cost | Annual cost |
|---|---|---|
| Mistral Small 3.2 | $0.10 | $37 |
| Gemini 2.0 Flash-Lite | $0.15 | $55 |
| DeepSeek V3.2 | $0.35 | $128 |
| GPT-5.4 mini | $1.95 | $712 |
| Claude Haiku 4.5 | $2.30 | $840 |
| Gemini 3.1 Pro | $5.20 | $1,898 |
| GPT-5.4 | $6.50 | $2,373 |
| Claude Sonnet 4.6 | $6.90 | $2,519 |
| Claude Opus 4.6 | $11.50 | $4,198 |
| GPT-5.4 Pro | $78.00 | $28,470 |
[stat] $37 vs $28,470 Annual cost difference between Mistral Small 3.2 and GPT-5.4 Pro for the same 365M medium-complexity requests
At 10M requests per day — the scale of a mid-size SaaS platform — multiply everything by 10. Mistral Small 3.2 is still under $400/year. GPT-5.4 Pro hits $284,700/year. Claude Opus 4.6 lands at a very reasonable $41,975/year, making it arguably the best value flagship model for high-volume production workloads.
The five scaling traps that blow AI budgets
1. Output token blindness
Input tokens are cheap across the board. The budget killer is always output tokens. GPT-5.4's output rate ($15/M) is 6x its input rate ($2.50/M). Claude Opus 4.6 has a 5:1 output-to-input ratio.
The fix: Constrain output aggressively. Use structured JSON responses, set max_tokens limits, and instruct models to be concise. Cutting average output from 300 to 150 tokens saves 50% on your largest cost line.
2. Context window creep
It starts with "let's include the full conversation history." Then someone adds "and the user's profile." Then "and similar past tickets." Before you know it, every request sends 10,000 input tokens when 800 would do.
The fix: Implement context budgets. Summarize conversation history instead of including full transcripts. Use RAG with targeted retrieval instead of stuffing everything into the prompt.
3. Retry storms
API failures happen. Rate limits hit. The naive approach — retry immediately with exponential backoff — can double or triple your token spend when a provider has a bad hour.
The fix: Cache successful responses. Deduplicate identical requests. Use circuit breakers that fail fast instead of retrying expensive calls. Never retry a request that returned a valid (if suboptimal) response.
4. Model over-specification
Using Claude Opus 4.6 for sentiment classification is like hiring a PhD to sort mail. It works, but Mistral Small 3.2 does the same job for 0.3% of the cost.
The fix: Benchmark your actual tasks against cheaper models. Most classification, extraction, and routing tasks perform identically on models costing under $1/M output tokens. Reserve flagship and reasoning models for tasks where they demonstrably improve outcomes.
5. Ignoring provider volume discounts
At enterprise scale, listed pricing is a starting point. OpenAI, Anthropic, and Google all offer volume-based discounts, committed-use pricing, and custom enterprise agreements.
The fix: Once you're spending $5,000+/month with a single provider, contact their sales team. Discounts of 20-40% are common at scale. The OpenAI Batch API offers 50% discounts for non-time-sensitive workloads without any negotiation required.
✅ TL;DR: The five cost killers at scale are output tokens, context bloat, retries, over-specified models, and ignoring volume discounts. Fix these before optimizing anything else.
The optimal scaling stack: a model for every tier
The most cost-effective production architecture isn't one model — it's a tiered system that routes requests to the cheapest model capable of handling them:
Tier 1 — Routing and classification (80% of requests): GPT-5.4 nano ($0.20/$1.25) or Mistral Small 3.2 ($0.06/$0.18). These models classify incoming requests and route them to the appropriate tier.
Tier 2 — Standard processing (15% of requests): GPT-5.4 mini ($0.75/$4.50), DeepSeek V3.2 ($0.28/$0.42), or Gemini 2.0 Flash ($0.10/$0.40). Handles most summarization, extraction, and simple generation tasks.
Tier 3 — Complex reasoning (5% of requests): Claude Sonnet 4.6 ($3/$15), GPT-5.4 ($2.50/$15), or Gemini 3.1 Pro ($2/$12). Reserved for tasks requiring nuanced understanding, long-form generation, or multi-step reasoning.
Tier 4 — Critical tasks only (<1% of requests): Claude Opus 4.6 ($5/$25) or GPT-5.4 Pro ($30/$180). Legal analysis, medical reasoning, complex code generation — tasks where accuracy is worth any price.
What tiered routing saves
For 1M medium-complexity requests using a single model versus tiered routing:
| Approach | Cost |
|---|---|
| All requests → Claude Opus 4.6 | $11.50 |
| All requests → GPT-5.4 | $6.50 |
| Tiered (80/15/4/1 split) | $1.24 |
Tiered routing saves 89% compared to GPT-5.4 and 89% compared to Claude Opus 4.6 for the same workload. At 10M requests/day, that's the difference between $23,725/year and $4,526/year.
Read our full guide on model routing strategies for implementation details.
Provider comparison: who wins at scale?
Each provider has distinct advantages at high volume:
Best for ultra-cheap processing: Mistral AI. Mistral Small 3.2 at $0.06/$0.18 is the cheapest capable model on the market. For classification and simple tasks at massive scale, nothing touches it.
Best flagship value: Google Gemini. Gemini 2.0 Flash at $0.10/$0.40 punches well above its price class, and Gemini 3.1 Pro at $2/$12 undercuts both GPT-5.4 and Claude Sonnet while offering competitive quality.
Best reasoning value: DeepSeek. V3.2 at $0.28/$0.42 offers reasoning-capable performance at budget prices. R1 V3.2 at the same price adds explicit chain-of-thought. For teams that need reasoning without the premium reasoning model tax, DeepSeek is the answer.
Best quality ceiling: Anthropic. Claude Opus 4.6 at $5/$25 delivers top-tier quality at roughly one-sixth the cost of GPT-5.4 Pro ($30/$180). For tasks where you need the absolute best output and can't compromise, Opus 4.6 gives you flagship quality without flagship bankruptcy.
Best ecosystem: OpenAI. The GPT-5.4 family spans from nano ($0.20/$1.25) to Pro ($30/$180), giving you a single provider for every tier. Add the Batch API for 50% off async workloads, and OpenAI becomes the easiest to scale with — even if individual models aren't always the cheapest.
Best open-source option: Meta Llama 4 Maverick via Together AI at $0.27/$0.85, or self-hosted for maximum cost control at high volumes.
📊 Quick Math: If you're spending $10,000/month on Claude Opus 4.6, switching the bottom 80% of requests to Claude Haiku 4.5 ($1/$5) saves $6,400/month — $76,800/year — while keeping Opus for the tasks that actually need it.
When to self-host vs use APIs
At very high scale, self-hosting open-source models (Llama 4, Mistral, DeepSeek) on your own GPUs becomes cost-competitive with API pricing. The crossover point depends on your volume:
Stick with APIs when:
- Processing under 10M requests/day
- Traffic is highly variable (seasonal spikes, unpredictable growth)
- You don't have ML infrastructure expertise in-house
- You need multiple model providers for redundancy
Consider self-hosting when:
- Processing 10M+ requests/day consistently
- You need data sovereignty (no tokens leaving your infrastructure)
- Your workload is predictable enough to provision GPUs efficiently
- You have ML ops capability to manage model serving
The break-even for a single A100 GPU running Llama 4 Maverick is roughly 5M requests/day for medium-complexity tasks — below that, APIs are cheaper when you factor in infrastructure management, monitoring, and scaling overhead.
Frequently asked questions
How much does 1 million AI API requests cost?
Between $0.02 and $240+ depending on the model and task complexity. Simple classification with Mistral Small 3.2 costs pennies. Complex generation with GPT-5.4 Pro costs hundreds of dollars — and potentially thousands when you account for reasoning tokens. Use our calculator to get exact numbers for your specific use case.
What's the cheapest AI model for high-volume production use?
Mistral Small 3.2 at $0.06 input / $0.18 output per million tokens is the cheapest capable model for most tasks. For even lower costs, Gemini 2.0 Flash-Lite at $0.07/$0.30 and GPT-5 nano at $0.05/$0.40 are strong alternatives. The right choice depends on your quality requirements — benchmark against your actual tasks before committing.
Do AI API providers offer volume discounts?
Yes. OpenAI, Anthropic, and Google all offer enterprise pricing with volume discounts, typically starting at $5,000-$10,000/month in spend. OpenAI's Batch API provides an automatic 50% discount for asynchronous workloads. Custom committed-use agreements can reduce per-token costs by 20-40% for predictable high-volume workloads.
How do reasoning model costs differ from standard models at scale?
Reasoning models (GPT-5.4 Pro, o3, o3-pro) generate hidden "thinking" tokens that multiply your actual costs by 3-10x beyond the listed per-token rate. A task that appears to cost $0.18 per request might actually cost $1.44-$2.16. Always budget 5x the listed output cost for reasoning models, and reserve them for tasks where chain-of-thought reasoning demonstrably improves results.
Should I use one AI model or multiple models in production?
Multiple models, routed by task complexity. A tiered approach using cheap models for simple tasks (80% of volume) and premium models for complex tasks (5% of volume) typically saves 80-90% compared to using a single flagship model for everything. See our model routing guide for implementation strategies.
Start calculating your actual costs
Every number in this guide came from real pricing data in our AI Cost Calculator. Plug in your expected request volume, average token counts, and target model to get exact monthly and annual projections.
The biggest mistake teams make at scale isn't choosing the wrong model — it's not doing the math at all. A 10-minute cost analysis before committing to a provider can save tens of thousands of dollars over a year.
Related reading:
