Published February 23, 2026Updated March 21, 2026

AI API Costs for RAG Applications: A Complete Breakdown

How much does it cost to run a RAG pipeline with OpenAI, Anthropic, Google, or Mistral? Real cost calculations for embedding, retrieval, and generation.

ragembeddingscost-analysisproduction2026

AI API Costs for RAG Applications: A Complete Breakdown

Retrieval-Augmented Generation (RAG) is the most popular way to build AI applications that work with your own data. But unlike a simple chatbot, RAG has multiple cost components — embeddings, storage, retrieval, and generation — that add up fast at scale. Most teams only budget for the LLM call and then get blindsided by the full picture.

Here's exactly what RAG costs in 2026 across every major provider, broken down by every layer of the pipeline, with real numbers you can use to budget your next project.

The three cost layers of RAG

Every RAG pipeline has three billable stages:

Embedding — Converting your documents into vectors (one-time + updates)
Retrieval — Searching your vector database (infrastructure cost, not API)
Generation — Sending retrieved context + query to an LLM for answers

Most cost discussions focus only on generation. That's a mistake — you need the full picture to budget accurately. Let's break down all three.

💡 Key Takeaway: Generation typically accounts for 80–95% of total RAG costs. But ignoring embedding and infrastructure costs can throw off your budget by 10–20% at scale.

Layer 1: Embedding costs

You pay embedding costs when you ingest documents. This is mostly a one-time cost, plus incremental updates as your knowledge base grows.

Model	Price per 1M tokens	Dimensions
OpenAI text-embedding-3-small	$0.02	1,536
OpenAI text-embedding-3-large	$0.13	3,072
Google text-embedding-005	$0.00 (free tier)	768
Mistral embed	$0.10	1,024
Cohere embed-v4	$0.10	1,024

Real-world example: Embedding 10,000 documents (averaging 2,000 tokens each) = 20M tokens.

OpenAI small: $0.40 (one-time)
OpenAI large: $2.60 (one-time)
Google: Free (within limits)

Embedding costs are almost negligible. Even at 1 million documents, you're looking at $4–26 with OpenAI. This is not where your budget goes.

📊 Quick Math: 1 million documents × 2,000 tokens average = 2 billion tokens. At OpenAI's small embedding rate ($0.02/M), that's $40 total to embed your entire knowledge base.

Layer 2: Vector database costs

This is infrastructure, not API — but it's a real RAG cost people forget:

Solution	Monthly Cost (1M vectors)
Pinecone Starter	Free (up to 100K)
Pinecone Standard	~$70/month
Qdrant Cloud	~$30/month
Weaviate Cloud	~$25/month
pgvector (self-hosted)	Your server cost
Chroma (self-hosted)	Free (your server)

For most startups, a self-hosted pgvector on a $20/month VPS handles millions of vectors easily. Don't overpay for managed vector databases until you actually need the scale features — most RAG apps under 5M vectors run fine on a single Postgres instance.

Layer 3: Generation costs (where the money goes)

This is 80–95% of your RAG spend. Every query sends retrieved chunks + the user's question to an LLM. The key variable: how much context you stuff in.

A typical RAG query looks like:

System prompt: ~200 tokens
Retrieved chunks: 3–8 chunks × 500 tokens = 1,500–4,000 tokens
User query: ~50 tokens
Total input: ~2,000–4,500 tokens per query
Output: ~300–800 tokens

Let's calculate monthly costs for 10,000 RAG queries/month with average 3,000 input + 500 output tokens per query:

Model	Input Cost	Output Cost	Total/Month
Gemini 2.5 Flash	$0.75	$2.00	$2.75
GPT-5 nano	$1.50	$2.00	$3.50
Mistral Small 3.2	$3.00	$1.50	$4.50
GPT-5 mini	$3.00	$6.00	$9.00
DeepSeek V3.2	$8.40	$2.10	$10.50
Mistral Large 3	$15.00	$7.50	$22.50
GPT-5	$37.50	$50.00	$87.50
Claude Sonnet 4.5	$90.00	$75.00	$165.00

$2.75/mo

Gemini 2.5 Flash (10K queries)

$165.00/mo

Claude Sonnet 4.5 (10K queries)

The winner for most RAG use cases: Gemini 2.5 Flash at $2.75/month for 10K queries. GPT-5 nano is close at $3.50. Both deliver surprisingly good retrieval-based Q&A despite being budget models.

Scaling up: 100K and 1M queries/month

Here's where model choice becomes your single most important financial decision:

100,000 queries/month

Model	Monthly Cost
Gemini 2.5 Flash	$27.50
GPT-5 nano	$35.00
Mistral Small 3.2	$45.00
GPT-5 mini	$90.00
GPT-5	$875.00
Claude Sonnet 4.5	$1,650.00

1,000,000 queries/month

Model	Monthly Cost
Gemini 2.5 Flash	$275
GPT-5 nano	$350
Mistral Small 3.2	$450
GPT-5	$8,750
Claude Sonnet 4.5	$16,500

[stat] $16,225/month The cost difference between Gemini Flash and Claude Sonnet 4.5 at 1M RAG queries — nearly $195K per year

Model choice is your biggest cost lever. Everything else — chunk optimization, caching, reranking — is optimization at the margins compared to picking the right generation model.

The RAG cost formula

Here's the formula to estimate your monthly RAG generation cost:

Monthly Cost = Queries × [(Input Tokens × Input Price) + (Output Tokens × Output Price)] / 1,000,000

Variables you control:

Number of retrieved chunks — fewer chunks = less input cost (but potentially worse answers)
Chunk size — smaller chunks reduce token count but may lose context
Output length — constrain with max_tokens if you need concise answers
Model choice — the single biggest lever

Use our AI Cost Calculator to run these numbers for your specific workload without doing the math by hand.

7 ways to reduce RAG costs

1. Use a cheaper model for simple queries

Route easy questions (FAQs, lookups) to GPT-5 nano or Gemini 2.5 Flash. Save flagship models for complex reasoning queries. A simple complexity router can cut costs 60–80%. We cover this in more detail in our 10 strategies to cut your AI API bill.

2. Optimize chunk count

Most RAG pipelines retrieve 5–10 chunks by default. Test with 3–4 chunks — often the top results contain 90% of relevant information. Going from 8 to 3 chunks saves ~50% on input tokens.

3. Compress retrieved context

Use a reranker (like Cohere Rerank) to filter chunks before sending to the LLM. Spend $0.001 on reranking to save $0.01 on generation tokens. That's a 10× ROI on every query.

4. Cache common queries

If 20% of queries are repeated (common in support bots), cache the responses. A simple semantic cache can reduce API calls by 15–30%. This is especially effective for internal knowledge bases where employees ask similar questions.

5. Use prompt caching

Anthropic and OpenAI both offer prompt caching that reduces input costs by 50–90% for repeated prefixes. If your system prompt + instructions are constant across requests (they usually are in RAG), you only pay full price once. This alone can halve your generation costs.

⚠️ Warning: Prompt caching has minimum token thresholds. Anthropic requires at least 1,024 tokens in the cached prefix, OpenAI requires 128. If your system prompt is shorter, pad it with instructions or examples to hit the threshold — the savings are worth it.

6. Batch non-urgent queries

OpenAI's Batch API offers 50% off for async processing. If your RAG queries don't need real-time responses (nightly report generation, batch document analysis, scheduled summaries), batch them and cut your generation bill in half.

7. Consider open-source models

Self-hosting Llama 4 Maverick or Mistral Small on your own GPU eliminates per-token costs entirely. Break-even vs API typically happens around 500K–1M queries/month depending on hardware costs. Our local vs cloud AI comparison has the full break-even analysis.

Real-world RAG cost example

Scenario: A legal tech startup building a contract analysis tool.

50,000 queries/month
Average 5 chunks retrieved (2,500 context tokens)
System prompt: 300 tokens
User query: 100 tokens
Total input: 2,900 tokens/query
Output: 600 tokens/query

Approach	Model	Monthly Cost
Premium	Claude Sonnet 4.5	$880
Balanced	GPT-5 mini	$48
Budget	Gemini 2.5 Flash	$15
Optimized	Flash + caching + routing	~$8

The optimized approach uses Gemini Flash for 80% of queries, GPT-5 mini for complex ones, plus prompt caching. Result: 99% cheaper than the premium approach with ~95% of the quality for retrieval-based tasks.

✅ TL;DR: The difference between a naive RAG deployment and an optimized one is 100× in cost. Smart model routing + prompt caching + chunk optimization gets you there without sacrificing answer quality.

Embedding + generation: total RAG cost

For a complete picture, here's the all-in monthly cost for a 50K query/month RAG app with 100K documents:

Component	Budget Approach	Premium Approach
Embedding (one-time, amortized)	~$1	~$1
Vector DB (Qdrant Cloud)	~$30	~$30
Generation	~$15 (Gemini Flash)	~$440 (GPT-5)
Total	~$46/month	~$471/month

Same RAG app, same documents, same queries — 10× cost difference based purely on which generation model you choose. The embedding and infrastructure layers are identical. For more ways to reduce your API spend, read our guide on how to reduce AI API costs.

Frequently asked questions

How much does a RAG application cost per month?

A RAG app serving 10,000 queries/month costs between $3 and $165 depending on which LLM you use for generation. Budget models like Gemini 2.5 Flash ($2.75/mo) and GPT-5 nano ($3.50/mo) handle most retrieval Q&A well. Add $20–70 for vector database hosting. Embedding costs are negligible — under $1 for most knowledge bases. Use our calculator to model your exact workload.

Are embedding costs significant for RAG?

No. Embedding is the cheapest part of a RAG pipeline. Even embedding 1 million documents costs only $4–40 depending on the model. OpenAI's text-embedding-3-small at $0.02/M tokens is the sweet spot — cheap, fast, and accurate enough for most retrieval tasks. Google's embedding model is free within generous limits. Learn more about token pricing fundamentals.

What's the cheapest way to run RAG at scale?

Use Gemini 2.5 Flash or GPT-5 nano for generation ($275–350/month at 1M queries), self-host pgvector for your database ($20/month VPS), and implement prompt caching to cut generation costs by 50%. A well-optimized RAG pipeline at 1M queries/month can run for under $200/month total. Check our cheapest AI APIs ranking for the latest budget options.

Should I use a flagship model like GPT-5 or Claude for RAG?

Usually not. RAG queries are retrieval-grounded — the model mostly needs to synthesize information that's already in the context, not reason from scratch. Budget models handle this well. Reserve flagship models for complex multi-step reasoning, ambiguous queries, or regulated domains where accuracy is critical. For a comparison of premium models, see our GPT-5 vs Claude Opus breakdown.

How do I calculate the cost of my specific RAG setup?

Use the formula: Monthly Cost = Queries × [(Input Tokens × Input Price) + (Output Tokens × Output Price)] / 1,000,000. Count your average input tokens by adding system prompt + retrieved chunks + user query. Set your output cap with max_tokens. Then multiply by your monthly query volume. Or skip the math and plug your numbers into our AI Cost Calculator.

All pricing data from our database, updated February 2026. Run your own RAG cost projection with the AI Cost Calculator.

Explore More

Related Cost Guides

Keep going with the closest pricing and optimization guides in this cluster.

RAG Costs in 2026: What Retrieval-Augmented Generation Actually Costs

RAG is often cheaper than fine-tuning, but plenty of teams still overspend. Here is the real 2026 cost breakdown for embeddings, retrieval, and answer generation.

ragcost-analysis

AI Knowledge Base Answering Costs in 2026: Cost Per Question, Per 100,000 Answers, and the Cheapest Models for Support Teams

Compare AI knowledge base answering costs for RAG, support deflection, internal help centers, and escalation workflows.

knowledge-basesupport

AI Embedding Model Pricing Guide 2026

A practical guide to embedding costs in 2026, with Gemini Embedding 2 pricing, retrieval math, and when embeddings beat large-context prompting.

embeddingspricing-guide