Retrieval-Augmented Generation (RAG) is the most popular way to build AI applications that work with your own data. But unlike a simple chatbot, RAG has multiple cost components — embeddings, storage, retrieval, and generation — that add up fast at scale. Most teams only budget for the LLM call and then get blindsided by the full picture.
Here's exactly what RAG costs in 2026 across every major provider, broken down by every layer of the pipeline, with real numbers you can use to budget your next project.
The three cost layers of RAG
Every RAG pipeline has three billable stages:
- Embedding — Converting your documents into vectors (one-time + updates)
- Retrieval — Searching your vector database (infrastructure cost, not API)
- Generation — Sending retrieved context + query to an LLM for answers
Most cost discussions focus only on generation. That's a mistake — you need the full picture to budget accurately. Let's break down all three.
💡 Key Takeaway: Generation typically accounts for 80–95% of total RAG costs. But ignoring embedding and infrastructure costs can throw off your budget by 10–20% at scale.
Layer 1: Embedding costs
You pay embedding costs when you ingest documents. This is mostly a one-time cost, plus incremental updates as your knowledge base grows.
| Model | Price per 1M tokens | Dimensions |
|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | 1,536 |
| OpenAI text-embedding-3-large | $0.13 | 3,072 |
| Google text-embedding-005 | $0.00 (free tier) | 768 |
| Mistral embed | $0.10 | 1,024 |
| Cohere embed-v4 | $0.10 | 1,024 |
Real-world example: Embedding 10,000 documents (averaging 2,000 tokens each) = 20M tokens.
- OpenAI small: $0.40 (one-time)
- OpenAI large: $2.60 (one-time)
- Google: Free (within limits)
Embedding costs are almost negligible. Even at 1 million documents, you're looking at $4–26 with OpenAI. This is not where your budget goes.
📊 Quick Math: 1 million documents × 2,000 tokens average = 2 billion tokens. At OpenAI's small embedding rate ($0.02/M), that's $40 total to embed your entire knowledge base.
Layer 2: Vector database costs
This is infrastructure, not API — but it's a real RAG cost people forget:
| Solution | Monthly Cost (1M vectors) |
|---|---|
| Pinecone Starter | Free (up to 100K) |
| Pinecone Standard | ~$70/month |
| Qdrant Cloud | ~$30/month |
| Weaviate Cloud | ~$25/month |
| pgvector (self-hosted) | Your server cost |
| Chroma (self-hosted) | Free (your server) |
For most startups, a self-hosted pgvector on a $20/month VPS handles millions of vectors easily. Don't overpay for managed vector databases until you actually need the scale features — most RAG apps under 5M vectors run fine on a single Postgres instance.
Layer 3: Generation costs (where the money goes)
This is 80–95% of your RAG spend. Every query sends retrieved chunks + the user's question to an LLM. The key variable: how much context you stuff in.
A typical RAG query looks like:
- System prompt: ~200 tokens
- Retrieved chunks: 3–8 chunks × 500 tokens = 1,500–4,000 tokens
- User query: ~50 tokens
- Total input: ~2,000–4,500 tokens per query
- Output: ~300–800 tokens
Let's calculate monthly costs for 10,000 RAG queries/month with average 3,000 input + 500 output tokens per query:
| Model | Input Cost | Output Cost | Total/Month |
|---|---|---|---|
| Gemini 2.5 Flash | $0.75 | $2.00 | $2.75 |
| GPT-5 nano | $1.50 | $2.00 | $3.50 |
| Mistral Small 3.2 | $3.00 | $1.50 | $4.50 |
| GPT-5 mini | $3.00 | $6.00 | $9.00 |
| DeepSeek V3.2 | $8.40 | $2.10 | $10.50 |
| Mistral Large 3 | $15.00 | $7.50 | $22.50 |
| GPT-5 | $37.50 | $50.00 | $87.50 |
| Claude Sonnet 4.5 | $90.00 | $75.00 | $165.00 |
The winner for most RAG use cases: Gemini 2.5 Flash at $2.75/month for 10K queries. GPT-5 nano is close at $3.50. Both deliver surprisingly good retrieval-based Q&A despite being budget models.
Scaling up: 100K and 1M queries/month
Here's where model choice becomes your single most important financial decision:
100,000 queries/month
| Model | Monthly Cost |
|---|---|
| Gemini 2.5 Flash | $27.50 |
| GPT-5 nano | $35.00 |
| Mistral Small 3.2 | $45.00 |
| GPT-5 mini | $90.00 |
| GPT-5 | $875.00 |
| Claude Sonnet 4.5 | $1,650.00 |
1,000,000 queries/month
| Model | Monthly Cost |
|---|---|
| Gemini 2.5 Flash | $275 |
| GPT-5 nano | $350 |
| Mistral Small 3.2 | $450 |
| GPT-5 | $8,750 |
| Claude Sonnet 4.5 | $16,500 |
[stat] $16,225/month The cost difference between Gemini Flash and Claude Sonnet 4.5 at 1M RAG queries — nearly $195K per year
Model choice is your biggest cost lever. Everything else — chunk optimization, caching, reranking — is optimization at the margins compared to picking the right generation model.
The RAG cost formula
Here's the formula to estimate your monthly RAG generation cost:
Monthly Cost = Queries × [(Input Tokens × Input Price) + (Output Tokens × Output Price)] / 1,000,000
Variables you control:
- Number of retrieved chunks — fewer chunks = less input cost (but potentially worse answers)
- Chunk size — smaller chunks reduce token count but may lose context
- Output length — constrain with max_tokens if you need concise answers
- Model choice — the single biggest lever
Use our AI Cost Calculator to run these numbers for your specific workload without doing the math by hand.
7 ways to reduce RAG costs
1. Use a cheaper model for simple queries
Route easy questions (FAQs, lookups) to GPT-5 nano or Gemini 2.5 Flash. Save flagship models for complex reasoning queries. A simple complexity router can cut costs 60–80%. We cover this in more detail in our 10 strategies to cut your AI API bill.
2. Optimize chunk count
Most RAG pipelines retrieve 5–10 chunks by default. Test with 3–4 chunks — often the top results contain 90% of relevant information. Going from 8 to 3 chunks saves ~50% on input tokens.
3. Compress retrieved context
Use a reranker (like Cohere Rerank) to filter chunks before sending to the LLM. Spend $0.001 on reranking to save $0.01 on generation tokens. That's a 10× ROI on every query.
4. Cache common queries
If 20% of queries are repeated (common in support bots), cache the responses. A simple semantic cache can reduce API calls by 15–30%. This is especially effective for internal knowledge bases where employees ask similar questions.
5. Use prompt caching
Anthropic and OpenAI both offer prompt caching that reduces input costs by 50–90% for repeated prefixes. If your system prompt + instructions are constant across requests (they usually are in RAG), you only pay full price once. This alone can halve your generation costs.
⚠️ Warning: Prompt caching has minimum token thresholds. Anthropic requires at least 1,024 tokens in the cached prefix, OpenAI requires 128. If your system prompt is shorter, pad it with instructions or examples to hit the threshold — the savings are worth it.
6. Batch non-urgent queries
OpenAI's Batch API offers 50% off for async processing. If your RAG queries don't need real-time responses (nightly report generation, batch document analysis, scheduled summaries), batch them and cut your generation bill in half.
7. Consider open-source models
Self-hosting Llama 4 Maverick or Mistral Small on your own GPU eliminates per-token costs entirely. Break-even vs API typically happens around 500K–1M queries/month depending on hardware costs. Our local vs cloud AI comparison has the full break-even analysis.
Real-world RAG cost example
Scenario: A legal tech startup building a contract analysis tool.
- 50,000 queries/month
- Average 5 chunks retrieved (2,500 context tokens)
- System prompt: 300 tokens
- User query: 100 tokens
- Total input: 2,900 tokens/query
- Output: 600 tokens/query
| Approach | Model | Monthly Cost |
|---|---|---|
| Premium | Claude Sonnet 4.5 | $880 |
| Balanced | GPT-5 mini | $48 |
| Budget | Gemini 2.5 Flash | $15 |
| Optimized | Flash + caching + routing | ~$8 |
The optimized approach uses Gemini Flash for 80% of queries, GPT-5 mini for complex ones, plus prompt caching. Result: 99% cheaper than the premium approach with ~95% of the quality for retrieval-based tasks.
✅ TL;DR: The difference between a naive RAG deployment and an optimized one is 100× in cost. Smart model routing + prompt caching + chunk optimization gets you there without sacrificing answer quality.
Embedding + generation: total RAG cost
For a complete picture, here's the all-in monthly cost for a 50K query/month RAG app with 100K documents:
| Component | Budget Approach | Premium Approach |
|---|---|---|
| Embedding (one-time, amortized) | ~$1 | ~$1 |
| Vector DB (Qdrant Cloud) | ~$30 | ~$30 |
| Generation | ~$15 (Gemini Flash) | ~$440 (GPT-5) |
| Total | ~$46/month | ~$471/month |
Same RAG app, same documents, same queries — 10× cost difference based purely on which generation model you choose. The embedding and infrastructure layers are identical. For more ways to reduce your API spend, read our guide on how to reduce AI API costs.
Frequently asked questions
How much does a RAG application cost per month?
A RAG app serving 10,000 queries/month costs between $3 and $165 depending on which LLM you use for generation. Budget models like Gemini 2.5 Flash ($2.75/mo) and GPT-5 nano ($3.50/mo) handle most retrieval Q&A well. Add $20–70 for vector database hosting. Embedding costs are negligible — under $1 for most knowledge bases. Use our calculator to model your exact workload.
Are embedding costs significant for RAG?
No. Embedding is the cheapest part of a RAG pipeline. Even embedding 1 million documents costs only $4–40 depending on the model. OpenAI's text-embedding-3-small at $0.02/M tokens is the sweet spot — cheap, fast, and accurate enough for most retrieval tasks. Google's embedding model is free within generous limits. Learn more about token pricing fundamentals.
What's the cheapest way to run RAG at scale?
Use Gemini 2.5 Flash or GPT-5 nano for generation ($275–350/month at 1M queries), self-host pgvector for your database ($20/month VPS), and implement prompt caching to cut generation costs by 50%. A well-optimized RAG pipeline at 1M queries/month can run for under $200/month total. Check our cheapest AI APIs ranking for the latest budget options.
Should I use a flagship model like GPT-5 or Claude for RAG?
Usually not. RAG queries are retrieval-grounded — the model mostly needs to synthesize information that's already in the context, not reason from scratch. Budget models handle this well. Reserve flagship models for complex multi-step reasoning, ambiguous queries, or regulated domains where accuracy is critical. For a comparison of premium models, see our GPT-5 vs Claude Opus breakdown.
How do I calculate the cost of my specific RAG setup?
Use the formula: Monthly Cost = Queries × [(Input Tokens × Input Price) + (Output Tokens × Output Price)] / 1,000,000. Count your average input tokens by adding system prompt + retrieved chunks + user query. Set your output cap with max_tokens. Then multiply by your monthly query volume. Or skip the math and plug your numbers into our AI Cost Calculator.
All pricing data from our database, updated February 2026. Run your own RAG cost projection with the AI Cost Calculator.
