Skip to main content
March 3, 2026

How Prompt Caching Cuts Your AI API Bill by Up to 90%

Prompt caching slashes input token costs by 50-90% on OpenAI and Anthropic. Here's the math, the implementation, and when it's actually worth it.

cost-optimizationprompt-cachingopenaianthropicfinops2026
How Prompt Caching Cuts Your AI API Bill by Up to 90%

If you're running a chatbot with a 5,000-token system prompt and serving 10,000 conversations per day, you're paying for those 5,000 tokens 10,000 times. That's 50 million tokens per day in system prompt costs alone — before a single user message is processed.

Prompt caching makes those repeated tokens near-free. OpenAI and Anthropic both offer it. Anthropic's implementation cuts cached input costs by 90%. OpenAI's cuts them by 50%. For any application that reuses large context blocks — chatbots, RAG pipelines, AI agents with tool definitions — caching is the single highest-leverage cost optimization available.

This guide shows you exactly how much you'll save, how the math works, and how to implement it correctly.

What is prompt caching?

When you send a request to an LLM, the model processes every token in your input from scratch. Prompt caching changes this: the provider stores a processed representation of frequently reused context and skips the recomputation on subsequent requests.

The cache applies to tokens at the beginning of your prompt — typically your system prompt, tool definitions, few-shot examples, or retrieved documents that don't change between requests. The user's actual message (which does change) still gets processed normally.

From your perspective, the API call looks identical. The difference shows up on your bill.

Two factors determine your savings:

  1. What percentage of your input tokens are cached (cache hit ratio)
  2. How much the provider discounts cached tokens

A chatbot where the system prompt is 4,000 of every 5,000 input tokens has an 80% cache-eligible ratio. At a 90% discount, you're paying 10% of normal cost on those 80% of tokens — that's an 88% reduction in total input costs.

OpenAI prompt caching: 50% off, automatic

OpenAI applies prompt caching automatically with zero configuration on GPT-4o, GPT-4o mini, the GPT-4.1 series, and the o-series reasoning models. You don't write any special code — OpenAI handles the caching server-side and passes the discount to you automatically.

Cache threshold: The prefix must be at least 1,024 tokens to be cached. Tokens are cached in increments of 128 tokens after that.

Cache duration: Typically 5-10 minutes of inactivity before the cache expires. Under high load, it may persist longer.

Discount rate: 50% off the normal input token price.

Here's what that means for the models you're most likely using:

Model Standard Input Cached Input Savings per 1M tokens
GPT-5.2 $1.75 $0.875 $0.875
GPT-5 mini $0.25 $0.125 $0.125
GPT-5 nano $0.05 $0.025 $0.025
GPT-4.1 $2.00 $1.00 $1.00
GPT-4.1 mini $0.40 $0.20 $0.20
GPT-4o $2.50 $1.25 $1.25
GPT-4o mini $0.15 $0.075 $0.075

💡 Key Takeaway: OpenAI's caching is completely automatic. If you're using GPT-4o, GPT-4.1, or the o-series with prompts over 1,024 tokens, you're already getting cached pricing — check your usage dashboard to confirm.

Anthropic prompt caching: 90% off reads, but you pay for writes

Anthropic's prompt caching is more powerful but requires explicit implementation. You control exactly what gets cached using cache_control markers in your API requests.

How it works:

  1. Cache write: First time a prompt is processed, Anthropic stores the cache. You pay 25% more than normal input price for the write.
  2. Cache read: Subsequent requests that hit the cache cost 90% less than normal input price.
  3. Cache duration: 5 minutes by default. Extended to 1 hour with explicit configuration.
Model Standard Input Cache Write Cache Read Read Savings
Claude Opus 4.6 $5.00/M $6.25/M $0.50/M 90%
Claude Sonnet 4.6 $3.00/M $3.75/M $0.30/M 90%
Claude Haiku 4.5 $1.00/M $1.25/M $0.10/M 90%
Claude 3.5 Haiku $0.80/M $1.00/M $0.08/M 90%

The cache write cost pays for itself quickly. If you write a 4,000-token system prompt once and read it 5 times, you break even on the write cost and every subsequent read is pure savings.

📊 Quick Math: Caching a 4,000-token Claude Sonnet system prompt:

  • Write cost: 4,000 × $3.75/M = $0.015
  • Normal cost (no cache): 4,000 × $3.00/M = $0.012 per call
  • Cache read cost: 4,000 × $0.30/M = $0.0012 per call
  • Break-even: 2 cache hits. Every call after that saves $0.0108.

Real cost savings: three scenarios

Let's run the numbers on common applications.

Scenario 1: Customer support chatbot

Setup: System prompt with brand guidelines, product catalog, and support policies — 6,000 tokens. Average user message: 150 tokens. Average response: 400 tokens. Volume: 5,000 conversations/day.

Without caching, input cost per conversation = 6,150 tokens. With caching, the 6,000-token system prompt is cached; you only pay full price on the 150-token user message.

Model No Cache (monthly) With Cache (monthly) Monthly Savings
Claude Opus 4.6 $4,612 $481 $4,131 (89%)
Claude Sonnet 4.6 $2,767 $289 $2,478 (89%)
GPT-4.1 $3,690 $1,868 $1,822 (49%)
GPT-4o mini $221 $112 $109 (49%)

[stat] $4,131/month Saved by caching a 6,000-token Claude Opus system prompt across 5,000 daily conversations

Scenario 2: AI agent with tool definitions

Setup: System prompt + tool definitions = 8,000 tokens. 10 API calls per agent task. User context grows per call (average 2,000 tokens accumulated). Average output: 800 tokens. Volume: 500 tasks/day.

The system prompt and tool definitions are constant across all calls within a task — prime caching candidates.

Model No Cache (monthly) With Cache (monthly) Monthly Savings
Claude Sonnet 4.6 $13,500 $1,755 $11,745 (87%)
Claude Haiku 4.5 $4,500 $585 $3,915 (87%)
GPT-5.2 $9,450 $5,040 $4,410 (47%)
GPT-5 mini $1,350 $720 $630 (47%)

Scenario 3: RAG application with large context

Setup: System prompt + 20 retrieved chunks = 12,000 tokens of mostly-static context per query. If you're using a fixed document set, those retrieved chunks repeat across users. User query: 200 tokens. Volume: 20,000 queries/day.

This is where caching gets interesting — and complicated. RAG context changes per user, but if many users are hitting the same documents (e.g., a product knowledge base), you can cache the document chunks and serve them to multiple users.

Model No Cache (monthly) Effective Cached Cost Monthly Savings
Claude Sonnet 4.6 $108,000 $14,400 $93,600 (87%)
Gemini 3 Flash $27,000 — (no cache)
GPT-4o $180,000 $91,800 $88,200 (49%)

At this scale, prompt caching is not optional. It's the difference between sustainable unit economics and burning cash on repeated context processing.

$14,400/mo
Claude Sonnet with caching
vs
$108,000/mo
Claude Sonnet without caching

When caching actually moves the needle

Prompt caching has a disproportionate impact under specific conditions. Here's how to assess whether it's worth implementing for your use case.

High cache ratio is essential

If your static context is 500 tokens and your user message is 3,000 tokens, caching saves 14% at best. The math only becomes compelling when static context is the majority of your input.

Good candidates:

  • System prompts over 2,000 tokens
  • Fixed tool/function definitions (common in agents)
  • Shared document context in RAG pipelines
  • Few-shot examples that don't change per user
  • Long conversation history templates

Poor candidates:

  • Short system prompts (under 500 tokens)
  • Primarily user-driven context
  • Highly personalized prompts that vary per user

Call frequency amplifies savings

Caching is a fixed write cost amortized over reads. At low volume (100 calls/day), the savings are real but modest. At 10,000+ calls/day, caching becomes one of your most important infrastructure decisions.

⚠️ Warning: Anthropic's cache expires after 5 minutes of inactivity. If your traffic is bursty with long gaps (e.g., 100 calls in 5 minutes, then nothing for an hour), you'll repeatedly pay the cache write cost without enough reads to recover it. Extended cache (1 hour) helps — factor this into your implementation.

Output tokens are always full price

Caching only applies to input tokens. Output tokens are never discounted. If your output is very long relative to your input (e.g., generating 4,000-word documents from a 500-token prompt), caching saves you almost nothing.

For output-heavy workloads, focus on output length control strategies instead.


Implementing Anthropic prompt caching

Anthropic requires explicit cache markers. Here's the implementation pattern:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6-20260101",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for AcmeCorp...",
            "cache_control": {"type": "ephemeral"}  # Cache this block
        }
    ],
    messages=[
        {"role": "user", "content": user_message}
    ]
)

The cache_control: {"type": "ephemeral"} marker tells Anthropic to cache everything up to that point in the prompt. The first call writes the cache (25% premium). Subsequent calls within 5 minutes read from cache (90% discount).

For extended caching (up to 1 hour), use {"type": "ephemeral", "ttl": 3600}.

What to cache vs. not cache:

  • ✅ Cache: System instructions, tool definitions, static document context
  • ✅ Cache: Multi-turn conversation history (cache the history, leave the new message uncached)
  • ❌ Don't cache: The user's current message
  • ❌ Don't cache: Dynamic context that varies per user

You can have multiple cache control markers to create layered caching. Anthropic charges separately for each cache level and applies discounts to each level independently.


Implementing OpenAI prompt caching

OpenAI's caching is automatic — no code changes required. But you can maximize your cache hit rate with smart prompt design:

Put static content first, always. OpenAI caches from the beginning of the prompt. If your system prompt changes between calls (e.g., injecting the current date), move the date injection to the end of the system prompt or into the user message. The static portion (which comes first) will still cache.

Avoid unnecessary variation. Don't regenerate your system prompt on every call. Load it once at startup and reuse the same string object. Even minor differences (extra whitespace, different line endings) will cause cache misses.

Check your cache hit rate. OpenAI returns cached_tokens in the usage object of every response. Monitor this in your logs — if it's consistently 0, your prompts are varying in ways you don't expect.

response = openai.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": STATIC_SYSTEM_PROMPT},
        {"role": "user", "content": user_message}
    ]
)

# Check cache performance
cached = response.usage.prompt_tokens_details.cached_tokens
total_input = response.usage.prompt_tokens
hit_rate = cached / total_input
print(f"Cache hit rate: {hit_rate:.1%}")

✅ TL;DR: OpenAI caching is automatic — just make sure your static context is at the top of your prompt and doesn't change between calls. Anthropic caching is explicit — add cache_control markers to the blocks you want cached. Both require minimum 1,024 tokens to trigger.


Stacking caching with other optimizations

Prompt caching compounds well with other cost reduction strategies. The biggest wins come from combining caching with:

Model routing: Cache your system prompt on a capable mid-tier model (Claude Sonnet, GPT-4.1) and route simple queries there. Use a cheap model (Haiku, GPT-5 nano) for genuinely simple tasks that don't need cached context. See our guide on 10 strategies to cut your AI API bill for the full routing breakdown.

Batch API: If your use case allows async processing, OpenAI's Batch API gives 50% off. Combined with prompt caching (which also applies to batch), you can achieve 75% total input cost reduction on compatible workloads.

Context compression: Before caching, make sure your static context is as tight as possible. A 6,000-token system prompt that could be 3,000 tokens without loss of quality will cache for less and consume fewer tokens on non-cached calls. Compress first, then cache.

Shorter conversations: For multi-turn chatbots, cache early conversation turns and summarize older turns instead of maintaining infinite history. This keeps the total context manageable while keeping caching effective. Our RAG application cost guide covers context management in detail.


Choosing between OpenAI and Anthropic for cacheable workloads

The 90% vs 50% caching discount is real, but it's not the whole story. Here's how to decide:

Choose Anthropic caching when:

  • You have control over the API integration (can add cache_control markers)
  • Your static context is large (5,000+ tokens)
  • Traffic is consistent enough to maintain warm caches
  • You're already using Claude for quality reasons

Choose OpenAI caching when:

  • You want zero implementation overhead (it's automatic)
  • You're using GPT-4o mini or GPT-4.1 mini for cost-sensitive workloads
  • Your system prompt is 1,000-2,000 tokens (OpenAI's simpler implementation is sufficient)
  • You're not sure if caching applies and want a safe default

For most developers starting out, OpenAI's automatic caching is the right call — it requires no code changes and immediately applies to any prompt over 1,024 tokens. If you're at significant scale (500K+ tokens/day in static context), the jump to Anthropic's 90% discount justifies the implementation overhead.

💡 Key Takeaway: At 1M cached input tokens per day, Anthropic's 90% discount saves $4.50/day more than OpenAI's 50% discount on Claude Sonnet. That's $1,642/year from switching caching implementations alone — worth it at scale, not worth the complexity at low volume.


Use the calculator to model your savings

The numbers in this guide use specific assumptions about prompt structure and traffic patterns. Your real savings depend on:

  • Your actual cache-eligible token ratio
  • Your daily request volume
  • Whether traffic is continuous (warm cache) or bursty (repeated cold starts)

Use the AI Cost Check calculator to plug in your specific token volumes and compare cached vs. uncached costs across providers. You can model both the before and after scenarios and see the exact dollar impact.

For deeper context on why tokens and pricing work the way they do, read our guide on what AI tokens are and how pricing works.


Frequently asked questions

Does prompt caching affect response quality?

No. The model sees the same tokens — they're just retrieved from cache instead of reprocessed. Response quality, capability, and behavior are identical whether or not a cache hit occurs.

What's the minimum prompt size to benefit from caching?

OpenAI requires at least 1,024 tokens for caching to trigger. Anthropic's minimum is also 1,024 tokens. Below these thresholds, every request is processed fresh. The practical savings threshold is higher — you need enough cached tokens to meaningfully reduce your total input cost.

How do I know if caching is working?

OpenAI returns prompt_tokens_details.cached_tokens in every response's usage object. A value of 0 means no cache hit. Anthropic returns cache_read_input_tokens and cache_creation_input_tokens in the usage object. Monitor these in your application logs to confirm cache hit rates.

Can I cache conversation history?

Yes, with careful implementation. For Anthropic, you can cache the conversation history up to the current turn by placing a cache_control marker at the end of the history block. This means older turns are served from cache while the new user message is processed fresh. For OpenAI, the system prompt and earlier messages are auto-cached if they appear consistently at the top of the request.

Does caching work with streaming?

Yes. Both OpenAI and Anthropic support prompt caching with streaming responses. The cache applies to the input token processing — the output streaming behavior is unaffected.

What happens when the cache expires?

Anthropic caches expire after 5 minutes of inactivity (1 hour with extended caching). OpenAI's cache duration is typically 5-10 minutes. After expiration, the next request writes a new cache entry at normal or write-premium pricing, then subsequent requests hit the refreshed cache.


Explore More