Read time

11 min

Sections

Focus

embeddings

Embeddings are the cheapest important model call in a modern AI stack, and they are also the easiest one to budget badly. Teams obsess over generation pricing, then quietly spend weeks re-embedding the same corpus, over-indexing low-value content, and stuffing too much text into vector databases that never deliver better retrieval.

In 2026, the core economic question is simple: should you pay once to turn content into vectors, or keep paying premium generation models to reread raw context on every request? For most retrieval-heavy products, embeddings win by a wide margin. The cheaper path is to pre-compute semantic representations, retrieve a small set of relevant chunks, and send only the useful context into a generation model.

This guide breaks down what that actually costs using real pricing from the AI Cost Check dataset. The key reference point is Gemini Embedding 2, currently listed at $0.20 per 1 million input tokens. I will compare that embedding cost against common generation models like GPT-5 mini, GPT-4o mini, and Gemini 2.0 Flash, then show where embeddings save money, where they do not, and how to size your budget without fooling yourself.

💡 Key Takeaway: If your app repeatedly searches the same documents, embeddings are not an optimization. They are the default architecture. Re-reading raw context with a generation model is the expensive detour.

The baseline price you should care about

The cleanest pricing anchor in the current dataset is Gemini Embedding 2:

Model	Type	Price per 1M input tokens	Price per 1M output tokens	Context window
Gemini Embedding 2	Embedding	$0.20	$0.20	8,192
Gemini 2.0 Flash	Generation	$0.10	$0.40	1,000,000
GPT-4o mini	Generation	$0.15	$0.60	128,000
GPT-5 mini	Generation	$0.25	$2.00	500,000
Gemini 2.5 Flash	Generation	$0.30	$2.50	1,000,000
Claude Haiku 4.5	Generation	$1.00	$5.00	200,000

At first glance, embeddings do not look dramatically cheaper than the lowest-cost generation models on pure input pricing alone. Gemini 2.0 Flash is actually $0.10 per million input tokens, which is lower than Gemini Embedding 2.

That is where shallow pricing analysis goes off the rails.

Embedding economics are not about raw input price alone. They are about one-time indexing versus repeated reading. Generation models charge you every time you send the same chunk again. An embedding model charges you to index the content once, then you query the vector store with tiny lookups and only pass the best matches into the expensive model.

[stat] 10x+ Re-reading the same corpus at generation time can cost more than embedding it once, even when the generation model has a lower input token price.

Why input-only comparisons mislead teams

If you shove a 5,000-token policy document into a generation model every time a user asks a question, you are paying that input price on every request. If you embed the document once, you pay to index the 5,000 tokens once, then only send the top few chunks back to the generation model later.

That architecture shift matters more than the sticker price.

⚠️ Warning: The most common RAG budgeting mistake is using generation pricing to estimate embedding-heavy workloads, then forgetting that bad chunking and repeated re-indexing can quietly double your actual bill.

When embeddings are dramatically cheaper than long-context prompting

Embedding systems save money when three conditions are true:

You search the same content repeatedly.
Most of the corpus is irrelevant to any one query.
You can answer with a few retrieved chunks instead of the whole document set.

That is almost every serious knowledge base, support bot, research assistant, policy search tool, and internal document copilot.

Here is a simple comparison. Assume you have 10 million tokens of documents to make searchable.

Option A: embed the corpus once

Using Gemini Embedding 2:

10,000,000 tokens / 1,000,000 = 10 units
10 × $0.20 = $2.00 to embed the full corpus

That is not a typo. Ten million tokens of indexing costs two dollars at the listed rate.

Option B: skip embeddings and stuff large context on demand

Imagine a cheap generation path using GPT-4o mini at $0.15 per million input tokens. If each user request re-sends even 50,000 tokens of raw context, then 1,000 requests cost:

50,000,000 total input tokens
50 × $0.15 = $7.50 input cost
Plus output cost for every answer

Now use GPT-5 mini at $0.25 per million input tokens for the same 1,000 requests:

50 × $0.25 = $12.50 input cost
Plus output at $2.00 per million output tokens

That is already 6.25x the one-time embedding cost on input alone, and the more requests you serve, the uglier it gets.

📊 Quick Math: Embedding a 10M-token corpus with Gemini Embedding 2 costs $2.00. Sending that same 10M tokens through GPT-5 mini just ten times costs $25.00 in input alone.

$2.00

Embed 10M tokens once with Gemini Embedding 2

$25.00

Re-read 10M tokens 10 times with GPT-5 mini

The point is not that you will literally send the whole corpus each time. The point is that long-context prompting becomes expensive fast when retrieval could have narrowed the payload.

Real-world scenario 1: a startup help center

Suppose you run a SaaS company with a help center, product docs, and onboarding material totaling 3 million tokens. You want a support assistant that answers 8,000 questions per month.

Embedding cost

Indexing 3 million tokens with Gemini Embedding 2:

3 × $0.20 = $0.60

Even if you fully re-embed the entire corpus once every month, your monthly embedding bill is still roughly sixty cents.

Retrieval plus answer generation

Assume each query retrieves 2,000 tokens of useful context and the answer is 400 output tokens. Using GPT-4o mini:

Input: 8,000 × 2,000 = 16,000,000 tokens
Input cost: 16 × $0.15 = $2.40
Output: 8,000 × 400 = 3,200,000 tokens
Output cost: 3.2 × $0.60 = $1.92
Total generation cost: $4.32
Add embeddings: $0.60
Monthly total: $4.92

If you skipped embeddings and sent 20,000 tokens of raw docs per query instead:

Input: 160,000,000 tokens
Input cost: 160 × $0.15 = $24.00
Output cost stays around $1.92
Monthly total: $25.92

That is more than 5x the cost for a worse architecture.

Recommendation

For support search, embeddings are mandatory. The content changes slowly, the same corpus is queried constantly, and the user only needs the most relevant pieces.

✅ TL;DR: If your product has stable documentation and repeat traffic, pay the small indexing cost and stop making generation models reread your whole library.

Real-world scenario 2: an internal knowledge bot for sales and ops

Now take a larger company corpus: 50 million tokens across transcripts, SOPs, proposals, and internal docs. Employees ask 20,000 questions per month.

Monthly indexing strategy

You probably do not re-embed the full 50 million tokens every day. A more realistic approach is:

Initial full indexing: 50 × $0.20 = $10.00
Monthly changes: assume 5 million new or updated tokens
Incremental monthly re-embedding: 5 × $0.20 = $1.00

That means the ongoing monthly embedding cost can be around one dollar after the initial build.

Retrieval and answer cost with Gemini 2.0 Flash

Assume each query retrieves 3,000 input tokens and generates 500 output tokens using Gemini 2.0 Flash:

Input: 20,000 × 3,000 = 60,000,000 tokens
Input cost: 60 × $0.10 = $6.00
Output: 20,000 × 500 = 10,000,000 tokens
Output cost: 10 × $0.40 = $4.00
Generation total: $10.00
Add monthly embedding refresh: $1.00
Total monthly run cost: $11.00

If the same system used crude prompt stuffing at 30,000 input tokens per request:

Input: 600,000,000 tokens
Input cost: 600 × $0.10 = $60.00
Output cost: $4.00
Total monthly run cost: $64.00

That is still affordable, but it is wasteful. More important, bigger prompts usually mean slower answers and more irrelevant noise.

Recommendation

For internal knowledge assistants, the biggest savings come from good retrieval discipline, not from squeezing the cheapest generation model. Embed once, update incrementally, and keep retrieved context tight.

Real-world scenario 3: content recommendation and semantic search

Embeddings are not just for chatbot retrieval. They are often the cheapest way to power related-article widgets, semantic site search, clustering, deduplication, and recommendation engines.

Assume a content site has 2 million articles, snippets, and metadata tokens to index and receives 500,000 search or recommendation events per month.

Indexing cost

2 × $0.20 = $0.40 to embed the full textual metadata set

Query path

For recommendation systems, many lookups never need a generation model at all. The user enters a query, you embed the query text, search vectors, and return ranked items.

Assume the average query length is 20 tokens:

500,000 × 20 = 10,000,000 query tokens
10 × $0.20 = $2.00

Your monthly semantic retrieval cost is roughly $2.00, plus vector database costs.

This is where vector DB economics matter. The embedding API is often cheap enough that the database, storage, replication, and latency tuning become the real bill. If you store too many low-value chunks, duplicate near-identical records, or keep dense metadata you never query, the database can cost more than the model.

💡 Key Takeaway: In mature RAG systems, the embedding bill is often the smallest line item. Storage bloat, over-chunking, and unnecessary re-index jobs are the costs that sneak up on you.

The vector database cost trap

Teams love to say embeddings are cheap, then build a vector store like they are free. That is how budgets get stupid.

Three patterns drive waste:

1. Over-chunking

If a clean 1,200-token article becomes twenty tiny chunks, you increase records, storage, and retrieval noise. You may also need more reranking or more downstream filtering. Cheaper embedding calls do not save you from a bloated index.

2. Full re-indexing instead of incremental updates

If only 2% of your corpus changes, re-embedding 100% of it is lazy engineering. At Gemini Embedding 2 prices, this might still look cheap at small scale, but the operational waste compounds with database rebuilds and pipeline time.

3. Embedding content that should never be retrieved

Boilerplate, navigation labels, duplicated templates, tiny fragments, and stale versions do not belong in the same retrieval layer as high-value source material.

A disciplined indexing pipeline does three things well: deduplicate aggressively, chunk around user intent, and re-embed only what changed.

⚠️ Warning: Cheap model pricing does not rescue a bad corpus. If you embed noise, retrieval quality drops and your generation bill rises because you need bigger prompts to compensate.

Are embeddings always the right answer?

No. There are clear cases where embeddings are the wrong tool.

Skip embeddings when the corpus is tiny

If the relevant data is always a few hundred tokens and changes per request, retrieval infrastructure can be more work than it is worth. A direct prompt to Gemini 2.0 Flash or GPT-4o mini may be cleaner.

Skip embeddings when the task is primarily reasoning, not retrieval

If the hard part is multi-step planning, tool use, or code generation, embeddings only solve the lookup slice. They do not replace a reasoning model like GPT-5 or Claude Sonnet 4.6.

Skip embeddings when data freshness is second-by-second

For rapidly changing transactional data, you may be better off querying the source system directly and prompting on structured results.

The correct framing is this: embeddings are the right tool for semantic recall, not for every intelligence problem.

How to choose a cost strategy in 2026

Here is the blunt recommendation stack.

Best choice for most teams

Use Gemini Embedding 2 for indexing and a low-cost generator such as Gemini 2.0 Flash or GPT-4o mini for answers. This is the best default for support search, internal copilots, and lightweight RAG.

Best choice for higher-quality answer generation

Keep the embedding layer cheap, then spend money only on the final answer model. For example, embed with Gemini Embedding 2, retrieve a narrow context, then answer with GPT-5 mini or Claude Haiku 4.5 when tone or reliability matters.

Worst choice

Do not use a premium generation model as a search engine. Sending giant context windows into GPT-5 or Claude Sonnet 4.6 because retrieval feels annoying is how you turn a cheap product into a finance problem.

For a broader budgeting workflow, use the calculator and read the token guide plus How to Reduce AI API Costs. Those three pages together cover the basics, the token math, and the optimization layer.

What this means for your architecture

If you are building a RAG product in 2026, the smart stack is boring:

Clean the corpus.
Chunk it sanely.
Embed once.
Re-embed incrementally.
Retrieve a small set of relevant chunks.
Spend generation dollars only on the answer.

That pattern keeps costs low because it separates cheap semantic indexing from expensive reasoning. It also creates better latency and cleaner observability. You can measure corpus size, change rate, retrieval hit quality, and answer cost independently.

The nice part is that the first step is cheap. Using the current dataset, embedding millions of tokens is measured in cents to single-digit dollars, not hundreds. That means you can prototype retrieval correctly from day one instead of promising yourself you will optimize later.

✅ TL;DR: Embeddings are not where most teams overspend. They overspend by skipping embeddings, sending too much context to generators, and storing too much junk in the vector layer.

Frequently asked questions

What is an embedding model?

An embedding model converts text into vectors that preserve semantic similarity. In practice, that lets you search by meaning instead of exact keywords, cluster similar content, and retrieve the most relevant chunks before calling a generation model.

How much does it cost to embed documents in 2026?

Using Gemini Embedding 2, the current listed price is $0.20 per 1 million input tokens. That means 10 million tokens cost about $2.00 to index, which is why embedding is usually one of the cheapest parts of a RAG stack.

Are embeddings cheaper than long-context prompting?

Yes, for repeated retrieval workloads. Long-context prompting may look cheap per request, but it becomes expensive when you repeatedly send large amounts of the same source material. Embedding the corpus once and retrieving a few relevant chunks is usually much cheaper.

What is the biggest hidden cost in embedding systems?

The hidden cost is rarely the embedding API itself. It is usually vector database sprawl, over-chunking, full re-indexes when incremental updates would do, and low-quality source material that forces larger generation prompts later.

Should I use embeddings for every AI feature?

No. Use embeddings for semantic recall, search, recommendation, and RAG. Do not reach for them when the task is mostly reasoning, when the source data is tiny, or when the best answer comes from live structured queries instead of document retrieval.

Use the cheap layer first, then pay for answers

The cleanest 2026 play is simple: keep semantic retrieval cheap, keep your vector store lean, and reserve generation spend for the final response. That is the architecture that scales.

If you want to test the numbers against your own workload, run the AI Cost Check calculator, compare answer models like GPT-5 mini and GPT-4o mini, and read AI API Pricing Guide 2026 for the bigger market picture. The savings are usually hiding in retrieval design, not in heroic prompt tricks.

Related Cost Guides

Keep going with the closest pricing and optimization guides in this cluster.

AI Embedding Model Pricing Guide 2026

The baseline price you should care about

Why input-only comparisons mislead teams

When embeddings are dramatically cheaper than long-context prompting

Option A: embed the corpus once

Option B: skip embeddings and stuff large context on demand

Real-world scenario 1: a startup help center

Embedding cost

Retrieval plus answer generation

Recommendation

Real-world scenario 2: an internal knowledge bot for sales and ops

Monthly indexing strategy

Retrieval and answer cost with Gemini 2.0 Flash

Recommendation

Real-world scenario 3: content recommendation and semantic search

Indexing cost

Query path

The vector database cost trap

1. Over-chunking

2. Full re-indexing instead of incremental updates

3. Embedding content that should never be retrieved

Are embeddings always the right answer?

Skip embeddings when the corpus is tiny

Skip embeddings when the task is primarily reasoning, not retrieval

Skip embeddings when data freshness is second-by-second

How to choose a cost strategy in 2026

Best choice for most teams

Best choice for higher-quality answer generation

Worst choice

What this means for your architecture

Frequently asked questions

What is an embedding model?

How much does it cost to embed documents in 2026?

Are embeddings cheaper than long-context prompting?

What is the biggest hidden cost in embedding systems?

Should I use embeddings for every AI feature?

Use the cheap layer first, then pay for answers

Related Cost Guides

Ternlight Brings 7 MB Browser-Native Embeddings: 6 Private Search Workflows Builders Can Ship Now

RAG Costs in 2026: What Retrieval-Augmented Generation Actually Costs

AI API Costs for RAG Applications: A Complete Breakdown