The question comes up constantly in developer communities: should you self-host a model with Ollama or vLLM, or just pay for a cloud API? Both options have obvious champions and obvious blind spots. The honest answer is: it depends on your volume, but you probably need more volume than you think before local wins, especially if you benchmark against options like Llama 4 Maverick.
Let's do the actual math.
[stat] 50M+ tokens/month The break-even point where self-hosting starts beating cloud APIs for mid-tier models like Claude Sonnet 4.5
The local setup: what it actually costs
Running models locally isn't free. You're trading API bills for hardware, electricity, maintenance, and the opportunity cost of your time.
Hardware
The GPU is the dominant cost. Here's what you need for common open-source models in 2026:
| Model | VRAM Required | Hardware Option | Cost |
|---|---|---|---|
| Llama 3.1 8B / Mistral 7B | 8GB | RTX 4060 Ti | ~$400 |
| Llama 3.1 70B (q4) | 40GB | RTX 4090 (24GB) + offload | ~$2,000 |
| Llama 3.3 70B (full) | 80GB | A100 80GB | ~$10,000 |
| Llama 3.1 405B (q4) | 200GB+ | Multi-GPU setup | $30,000+ |
Quantization matters. Running Llama 70B at 4-bit quantization (q4) cuts VRAM requirements roughly in half with minimal quality loss. Most practical self-hosters run quantized models.
For the "reasonable home lab" setup: an RTX 4090 at ~$1,800-2,200 handles 7B-13B models at full precision, or 70B at heavy quantization. This is the most popular entry point.
For production-grade self-hosting: you're looking at A100 80GB territory ($10,000-15,000 used), or renting bare-metal GPU servers.
Electricity
This one surprises people. GPUs draw serious wattage.
- RTX 4090: 350W at load
- A100 80GB: 400W at load
At the US average electricity rate of $0.12/kWh:
- RTX 4090 running 24/7: $30/month
- RTX 4090 at 30% utilization: ~$9/month
- A100 running 24/7: $35/month
- Rented GPU cloud instance (A100, Lambda Labs): ~$1.30/hour = ~$950/month at 24/7
📊 Quick Math: A rented A100 at $950/month costs more than most cloud API bills under 100M tokens/month. Only buy dedicated GPU time if your volume justifies it.
Electricity is cheap if the hardware is already yours. Rented GPU cloud is a different story — you're effectively paying cloud API prices anyway, with more operational overhead.
Hidden costs
- Setup time: Getting Ollama or vLLM configured, tuned for inference speed, and integrated into your stack takes hours or days. That's real cost.
- Maintenance: Model updates, quantization choices, prompt caching, batching — all manual. Cloud providers handle this automatically.
- Inference speed: A single RTX 4090 generates roughly 30-50 tokens/second for a 7B model. Cloud APIs with optimized serving infrastructure often hit 100-300+ tokens/second.
- Reliability: Your local machine goes down. Cloud APIs have SLAs.
⚠️ Warning: Don't forget to factor in your engineering time. If you spend 20 hours setting up and maintaining a local inference stack at $100/hour equivalent, that's $2,000 in opportunity cost before you save a single dollar on tokens.
Cloud API costs in 2026
For comparison, here's what the cloud alternatives actually cost:
| Model | Input $/1M | Output $/1M | Notes |
|---|---|---|---|
| GPT-5 nano | $0.05 | $0.40 | OpenAI's cheapest |
| Gemini 2.0 Flash | $0.10 | $0.40 | Fast, cheap |
| DeepSeek V3.2 | $0.28 | $0.42 | Best value mid-tier |
| Gemini 2.5 Flash | $0.15 | $0.60 | Better quality |
| GPT-5 Mini | $0.25 | $2.00 | OpenAI mid-tier |
| GPT-5.2 | $1.75 | $14.00 | OpenAI flagship |
| Claude Sonnet 4.5 | $3.00 | $15.00 | Anthropic mid-tier |
| Claude Opus 4.6 | $5.00 | $25.00 | Anthropic flagship |
The key insight: budget cloud models like GPT-5 nano and Gemini 2.0 Flash are so cheap that self-hosting rarely beats them on cost, which is why teams often start with the cheapest AI APIs. The break-even case is strongest against premium models like Claude Sonnet and Opus.
The break-even analysis
This is where it gets concrete. Let's compare three realistic scenarios.
Scenario 1: Replacing GPT-5 nano with local Llama 8B
You're running a simple chatbot or classification task. GPT-5 nano at $0.05/$0.40 is already cheap. Llama 3.1 8B is its open-source equivalent.
Local setup: RTX 4060 Ti ($400) + $5/month electricity
| Monthly Tokens | GPT-5 nano Cost | Local Cost (electricity only) | Break-Even Point |
|---|---|---|---|
| 1M input + 200K output | $0.13 | $5.00 | Never (local costs more) |
| 10M input + 2M output | $1.30 | $5.00 | Never |
| 100M input + 20M output | $13.00 | $5.00 | ~36 months for hardware |
| 1B input + 200M output | $130.00 | $5.00 | ~3 months for hardware |
Verdict: At GPT-5 nano prices, you need extremely high volume before local pays off. The model is simply too cheap.
Scenario 2: Replacing Claude Sonnet 4.5 with local Llama 70B
Now we're talking. Claude Sonnet 4.5 at $3.00/$15.00 is where self-hosting starts making financial sense.
Local setup: RTX 4090 ($2,000) running quantized Llama 70B + $10/month electricity at 30% utilization
| Monthly Tokens | Claude Sonnet Cost | Local Cost | Break-Even |
|---|---|---|---|
| 1M input + 500K output | $10.50 | $10 | ~200 months |
| 10M input + 5M output | $105 | $10 | ~20 months |
| 50M input + 25M output | $525 | $10 | ~4 months |
| 100M input + 50M output | $1,050 | $10 | ~2 months |
💡 Key Takeaway: Replacing Claude Sonnet 4.5 with a local 70B model becomes financially compelling at 50M+ tokens/month. At 100M tokens/month, your $2,000 hardware investment pays for itself in under 2 months.
Scenario 3: Production scale with rented GPU
If you need reliability, 24/7 availability, and burst capacity — you'll likely rent GPU compute rather than buy hardware.
Lambda Labs A100 (80GB): $1.30/hour = **$950/month**
| Monthly Tokens | Claude Sonnet Cost | Rented A100 Cost | Winner |
|---|---|---|---|
| 10M/5M | $105 | $950 | Cloud API |
| 100M/50M | $1,050 | $950 | Rented GPU |
| 500M/250M | $5,250 | $950 | Rented GPU (5.5x cheaper) |
Rented GPU infrastructure beats Claude Sonnet at roughly 90M tokens/month. At 500M tokens/month, you're saving $4,300 per month, and you can sanity-check that estimate against current cost-per-million token benchmarks.
When local actually makes sense
After running the numbers, here's when self-hosting wins:
Go local when:
- You're processing 50M+ tokens/month on mid-tier or higher models
- You have data privacy requirements that prohibit sending data to third parties
- Your workload is batch-friendly (no latency requirements)
- You already have the hardware (marginal electricity cost changes the math dramatically)
- You're running a fine-tuned model that doesn't exist in any cloud API
Stay cloud when:
- Volume is low or unpredictable
- You need burst capacity (local hardware maxes out)
- Latency matters (cloud APIs are typically faster)
- You're using models in the sub-$1/M tier where self-hosting ROI is poor
- Reliability and uptime are non-negotiable
For help estimating your cloud costs accurately, use our AI Cost Calculator — then compare against the hardware numbers above.
Practical middle ground: Ollama for dev, cloud for prod
The most common pattern for cost-conscious developers in 2026: run Ollama locally during development (zero API costs), deploy with cloud APIs in production where reliability and speed matter.
This gets you:
- Free experimentation and iteration
- No API bills while building
- Production-grade reliability when it counts
- Flexibility to switch models without hardware investment
If your production volume eventually justifies self-hosting, you make that switch with real usage data — not guesses. Track your actual costs with our cost optimization strategies guide first.
A note on model quality gaps
The math above assumes local models are comparable to cloud equivalents. They're close, but not identical.
- Llama 3.1 8B ≈ GPT-4o mini in most benchmarks — solid for classification, summarization, simple Q&A once you understand how AI tokens map to text volume
- Llama 3.3 70B (quantized) ≈ GPT-5.1 / Claude Sonnet 4.5 — genuinely competitive for most tasks
- Llama 4 Maverick — available as API at $0.27/$0.85, often a better deal than self-hosting Llama 405B
- Llama 405B ≈ GPT-5.2 — but the hardware cost to run it approaches cloud pricing anyway
For most general-purpose tasks, the quality gap at 70B is small enough to justify self-hosting if your volume is there. For complex reasoning, coding, or tasks where Claude/GPT-5 quality matters, the gap may be worth paying for. Compare models head-to-head with our model comparison tool.
📊 Quick Math: Running Llama 4 Maverick via API at $0.27/$0.85 per 1M tokens costs just $1.12 per 1M in+out — cheaper than every cloud model except GPT-5 nano and Gemini Flash-Lite, with mid-tier quality.
Batch processing: the cloud middle ground
If your workload isn't latency-sensitive, OpenAI's Batch API offers 50% off standard pricing for 24-hour turnaround. That means GPT-5.2 drops to $0.875/$7.00 per million tokens — significantly closing the gap with self-hosting and eliminating hardware risk entirely. Before investing in local infrastructure, check whether batch pricing solves your cost problem.
The cost of doing nothing
Many teams delay the local vs cloud decision and stick with whatever they started with. If you're spending $5,000+/month on Claude Sonnet or GPT-5.2 API calls, even a rough break-even analysis could reveal massive savings.
A $2,000 GPU investment that saves $1,000/month in API costs pays for itself in 2 months. But only if your volume is there. Run your numbers through our calculator first.
TL;DR decision table
| Situation | Recommendation |
|---|---|
| <10M tokens/month | Cloud API (always) |
| 10-50M tokens/month, cheap models | Cloud API |
| 10-50M tokens/month, Sonnet+ tier | Evaluate local |
| >50M tokens/month, Sonnet+ tier | Self-host or rent GPU |
| Privacy requirements | Self-host regardless of volume |
| Dev/prototyping | Ollama local, free |
✅ TL;DR: The tipping point for most developers landing on premium models (Claude Sonnet, GPT-5.2) is around 50-100M tokens/month. Below that, cloud APIs win on simplicity. Above it, the hardware pays for itself fast.
Frequently asked questions
At what volume does self-hosting AI become cheaper than cloud APIs?
For premium models like Claude Sonnet 4.5 ($3.00/$15.00), self-hosting breaks even at approximately 50M tokens/month with consumer GPU hardware (~$2,000 investment). For budget models like GPT-5 nano ($0.05/$0.40), you'd need over 1 billion tokens/month before self-hosting makes financial sense. The more expensive the cloud model you're replacing, the faster self-hosting pays off.
What hardware do I need to run AI models locally?
For small models (7-8B parameters): an RTX 4060 Ti ($400) with 8GB VRAM is sufficient. For mid-size models (70B quantized): an RTX 4090 ($2,000) with 24GB VRAM. For large models (405B+): you need multi-GPU setups starting at $30,000+. Most developers start with an RTX 4090, which handles quantized 70B models well.
Is Ollama good enough for production use?
Ollama is excellent for development and small-scale production. For high-throughput production workloads, vLLM or TGI (Text Generation Inference) offer better performance with features like continuous batching and optimized inference. The trade-off is more complex setup. For production at scale, rented GPU instances with vLLM are the standard approach.
How much electricity does running AI locally cost?
An RTX 4090 at full load draws 350W, costing about $30/month at US average electricity rates ($0.12/kWh) if running 24/7. At realistic 30% utilization, it's closer to $9/month. An A100 draws 400W, costing about $35/month. Electricity is rarely the deciding factor — hardware and engineering time dominate the cost equation.
Can I mix local and cloud AI in the same application?
Yes, and it's a smart strategy. Route high-volume, simple tasks to local models (free after hardware cost) and use cloud APIs for complex queries where quality matters. For example: use local Llama 8B for classification and extraction, but call GPT-5.2 or Claude Opus for nuanced reasoning. This hybrid approach minimizes costs while maintaining quality where it counts.
