Skip to main content
March 9, 2026

AI Model Routing: How to Cut API Costs 70% by Using the Right Model for Each Task

Stop sending every request to your most expensive model. AI model routing matches each task to the cheapest model that can handle it — saving 50-80% on API costs without sacrificing quality. Full implementation guide with real pricing math.

cost-optimizationmodel-routingfinopsengineering2026
AI Model Routing: How to Cut API Costs 70% by Using the Right Model for Each Task

Most teams pick one AI model and send everything to it. Customer support tickets, code reviews, document summaries, creative writing — all routed to the same endpoint. That's like hiring a senior engineer to answer the phone.

AI model routing fixes this. Instead of one model for everything, you classify each request and send it to the cheapest model that can handle it well. Simple classification tasks go to GPT-5 nano at $0.05/$0.40 per million tokens. Complex reasoning goes to Claude Opus 4.6 at $5/$25 per million tokens. Everything in between gets matched to the right tier using the same logic behind cheapest AI API comparisons.

The result? Most teams cut their API bill by 50-80% without any measurable drop in output quality.

[stat] 70% Average cost reduction when teams implement three-tier model routing instead of using a single flagship model

This guide covers exactly how to implement model routing — from the simple version you can ship today to the sophisticated version that optimizes automatically.


Why single-model architectures waste money

Here's the uncomfortable truth about most AI deployments: 80% of requests don't need a flagship model.

Think about the actual requests hitting your API. A customer asks "what are your business hours?" — that's a lookup, not a reasoning challenge. Someone submits a form and you need to extract three fields — that's structured extraction, not creative writing. A user asks for a one-paragraph summary of an email — even the cheapest models nail this.

Yet most teams send all of these to GPT-5 or Claude Sonnet because it's simpler to maintain one integration. Simple, yes. But expensive.

Let's put real numbers on it. Consider a SaaS app handling 1 million requests per month with an average of 500 input tokens and 300 output tokens per request:

Model Input cost Output cost Monthly total
Claude Opus 4.6 $2,500 $7,500 $10,000
Claude Sonnet 4.6 $1,500 $4,500 $6,000
GPT-5.4 $1,250 $4,500 $5,750
GPT-5 mini $125 $600 $725
GPT-5 nano $25 $120 $145
Gemini 2.0 Flash $50 $120 $170
DeepSeek V3.2 $140 $126 $266

The spread between the cheapest and most expensive option is 69×. Even moving from a flagship to a mid-tier model cuts your bill in half. Routing across multiple tiers? That's where the real savings live.

💡 Key Takeaway: You don't need to sacrifice quality to save money. You need to stop over-provisioning intelligence for simple tasks.


The three-tier routing model

The simplest effective routing strategy uses three tiers. Each tier maps to a class of tasks with different complexity requirements.

Tier 1: Nano/Flash — Simple tasks ($0.05-$0.30 per million input tokens)

Best models: GPT-5 nano ($0.05/$0.40), Gemini 2.0 Flash-Lite ($0.075/$0.30), Mistral Small 3.2 ($0.06/$0.18), GPT-4.1 nano ($0.10/$0.40)

Use for:

  • Text classification and sentiment analysis
  • Entity extraction from structured inputs
  • Simple Q&A with provided context
  • Language detection
  • Content moderation and filtering
  • Formatting and template filling
  • Short summaries (under 100 words)

These models handle 40-60% of typical production traffic. They're fast (under 200ms for most requests), dirt cheap, and perfectly accurate for well-defined tasks. Mistral Small 3.2 at $0.06/$0.18 is particularly impressive — it costs less than a tenth of a cent per typical request.

Tier 2: Mid-range — Moderate tasks ($0.25-$3.00 per million input tokens)

Best models: GPT-5 mini ($0.25/$2.00), Gemini 2.5 Flash ($0.30/$2.50), Claude Haiku 4.5 ($1.00/$5.00), DeepSeek V3.2 ($0.28/$0.42)

Use for:

  • Multi-paragraph summaries and analysis
  • Conversational AI with personality
  • Code generation for standard patterns
  • RAG-powered answers requiring synthesis
  • Email drafting and content writing
  • Data transformation and parsing
  • Multi-step reasoning with straightforward logic

This tier handles 30-40% of traffic. These models punch well above their price point. DeepSeek V3.2 is the standout value here — flagship-competitive quality at $0.28/$0.42 per million tokens, which is cheaper than most "budget" options from other providers, as we show in DeepSeek vs GPT-5 Mini.

$0.28
DeepSeek V3.2 per M input
vs
$3.00
Claude Sonnet 4.6 per M input

Tier 3: Flagship/Reasoning — Complex tasks ($1.25-$15.00 per million input tokens)

Best models: GPT-5.4 ($2.50/$15.00), Claude Sonnet 4.6 ($3.00/$15.00), Claude Opus 4.6 ($5.00/$25.00), Gemini 3.1 Pro ($2.00/$12.00), Grok 4 ($3.00/$15.00)

Use for:

  • Complex multi-step reasoning
  • Creative writing requiring nuance
  • Code review and architecture decisions
  • Legal or medical analysis
  • Long-form content generation
  • Tasks requiring deep domain knowledge
  • Agentic workflows with tool use

This tier handles only 10-20% of traffic — but these are the requests where quality matters most. GPT-5.4 and Claude Sonnet 4.6 offer the best balance of capability and cost in this tier. Reserve Claude Opus 4.6 and GPT-5.4 Pro for genuinely difficult problems where the extra quality justifies 2-10× the cost.

📊 Quick Math: If 50% of your traffic goes to Tier 1, 35% to Tier 2, and 15% to Tier 3, your blended cost is roughly $0.50-$1.50 per million input tokens — compared to $3-$5 for a single flagship model. That's a 60-85% reduction, and you can validate the mix with a tokens-per-dollar benchmark.


How to classify requests for routing

The routing layer needs to decide which tier handles each request. There are four approaches, from simple to sophisticated.

Approach 1: Static routing by endpoint

The simplest method. Map each API endpoint or feature to a fixed tier:

/api/classify-ticket     → Tier 1 (GPT-5 nano)
/api/summarize-email     → Tier 1 (Gemini 2.0 Flash-Lite)
/api/chat                → Tier 2 (GPT-5 mini)
/api/generate-report     → Tier 3 (Claude Sonnet 4.6)
/api/code-review         → Tier 3 (GPT-5.4)

Pros: Zero latency overhead, dead simple to implement, easy to reason about costs. Cons: No flexibility within endpoints. A simple chat question gets the same model as a complex one.

This approach works well for applications where each endpoint has a consistent complexity level. Most teams should start here and add sophistication later.

Approach 2: Keyword and heuristic routing

Add simple rules based on the request content:

  • Short inputs (under 50 tokens) → Tier 1
  • Requests mentioning "analyze," "compare," "explain why" → Tier 3
  • Requests with code blocks → Tier 2 or 3 based on length
  • Translation or formatting requests → Tier 1

Pros: Catches obvious cases without an LLM call. No added latency. Cons: Brittle. Users who phrase things unusually get misrouted.

Approach 3: LLM-based classification

Use a nano-tier model to classify the request complexity before routing:

System: Classify this user request as SIMPLE, MODERATE, or COMPLEX.
SIMPLE = lookup, classification, short extraction, formatting
MODERATE = summarization, standard generation, conversational
COMPLEX = multi-step reasoning, analysis, creative, code architecture

Respond with only the classification word.

A GPT-5 nano call for this classification costs roughly $0.00002 per request (about 50 input tokens, 1 output token). At a million requests per month, that's $20 — negligible compared to the thousands saved by routing correctly.

Pros: Adapts to any request type. High accuracy. Cons: Adds 100-200ms latency per request. Occasional misclassification.

Approach 4: ML classifier with LLM fallback

Train a lightweight classifier (logistic regression or small transformer) on labeled routing decisions. Use the LLM classifier as fallback for low-confidence predictions.

Pros: Near-zero latency, highest accuracy over time, no per-request LLM cost. Cons: Requires labeled data and ML infrastructure. Overkill for most teams.

✅ TL;DR: Start with static routing (Approach 1). Add heuristic rules as you identify patterns. Graduate to LLM classification when your volume justifies the engineering investment.


Real-world routing examples with cost math

Let's work through three realistic scenarios to show exactly how much routing saves.

Scenario 1: Customer support chatbot

A SaaS company handles 500,000 chat messages per month. Average: 200 input tokens, 400 output tokens per message.

Without routing (all Claude Sonnet 4.6):

  • Input: 100M tokens × $3/M = $300
  • Output: 200M tokens × $15/M = $3,000
  • Monthly total: $3,300

With three-tier routing:

Category % of traffic Model Input cost Output cost
FAQ/simple lookup 45% Mistral Small 3.2 $0.54 $3.24
Standard support 40% GPT-5 mini $10 $160
Escalated/complex 15% Claude Sonnet 4.6 $45 $450
Total $669

[stat] $2,631/month Saved by routing a customer support chatbot across three tiers instead of using Claude Sonnet for everything

That's a 80% cost reduction. The FAQ answers are just as good from Mistral Small — probably faster, too. The complex escalations still get Claude Sonnet's full capability. Customers notice no difference.

Scenario 2: Code review pipeline

A development team runs 50,000 code reviews per month. Average: 2,000 input tokens (code diff), 800 output tokens (review comments).

Without routing (all GPT-5.4):

  • Input: 100M tokens × $2.5/M = $250
  • Output: 40M tokens × $15/M = $600
  • Monthly total: $850

With routing:

Category % of traffic Model Input cost Output cost
Style/lint checks 35% GPT-4.1 nano $7 $5.60
Standard reviews 45% DeepSeek V3.2 $12.60 $7.56
Architecture reviews 20% GPT-5.4 $50 $120
Total $203

Savings: $647/month (76%).

Style and lint checks are pattern matching — a nano model handles them perfectly. Standard code reviews (variable naming, error handling, common patterns) work great on DeepSeek V3.2. Only architecture-level reviews need the full power of GPT-5.4.

Scenario 3: Content generation platform

A marketing platform generates 200,000 pieces of content per month. Average: 500 input tokens, 1,500 output tokens.

Without routing (all Claude Sonnet 4.6):

  • Input: 100M tokens × $3/M = $300
  • Output: 300M tokens × $15/M = $4,500
  • Monthly total: $4,800

With routing:

Category % of traffic Model Input cost Output cost
Social media captions 30% Gemini 2.5 Flash $9 $225
Email/blog drafts 45% GPT-5 mini $22.50 $540
Premium long-form 25% Claude Sonnet 4.6 $75 $1,125
Total $1,997

Savings: $2,803/month (58%).

Social media captions don't need a flagship model. They need to be on-brand and grammatically correct — Gemini 2.5 Flash handles that at a fraction of the cost. The premium long-form content still gets Claude's nuanced writing.

⚠️ Warning: Don't route safety-critical tasks (medical advice, legal analysis, financial recommendations) to budget models just to save money. The cost of a wrong answer far exceeds the API savings.


Building your routing layer: implementation guide

Here's a practical architecture for a routing layer you can build in a day.

The router component

Your router sits between your application and the LLM providers. It needs three things:

  1. A routing table mapping task types to models
  2. A classifier that determines the task type
  3. A provider abstraction that normalizes API calls across OpenAI, Anthropic, Google, etc.

Most LLM proxy tools (LiteLLM, Portkey, Martian) already provide the provider abstraction. You add the routing logic on top, then update model prices from your canonical AI API pricing guide.

Routing table design

Keep it simple. A JSON config that maps categories to models:

{
  "routes": {
    "classification": { "model": "gpt-5-nano", "maxTokens": 100 },
    "extraction": { "model": "mistral-small-3.2", "maxTokens": 500 },
    "summarization": { "model": "gpt-5-mini", "maxTokens": 1000 },
    "conversation": { "model": "deepseek-v3.2", "maxTokens": 2000 },
    "analysis": { "model": "claude-sonnet-4.6", "maxTokens": 4000 },
    "creative": { "model": "gpt-5.4", "maxTokens": 4000 },
    "reasoning": { "model": "claude-opus-4.6", "maxTokens": 8000 }
  },
  "default": { "model": "gpt-5-mini", "maxTokens": 2000 }
}

Fallback strategy

Always have a fallback plan:

  1. If the routed model fails → retry with the next tier up
  2. If classification confidence is low → default to mid-tier
  3. If the response quality is poor (detected by output validation) → re-send to a higher tier

The fallback adds cost for individual requests but saves money overall by keeping the baseline tier low.

Monitoring and optimization

Track three metrics per route:

  • Cost per request — Are you actually saving money?
  • Latency — Are budget models fast enough?
  • Quality score — Are users satisfied with the outputs? (Use thumbs up/down, automated eval, or spot checks)

Review these weekly. If a budget model is getting poor quality scores on a route, bump it up one tier. If a flagship model is handling trivially simple requests, bump it down.

💡 Key Takeaway: The best routing configuration is one you iterate on. Start conservative (route less traffic to budget models), then gradually shift traffic down as you validate quality.


Provider-specific routing considerations

Each provider has unique features that affect routing decisions.

OpenAI

The GPT-5 family gives you the most granular tier options: nano ($0.05/$0.40), mini ($0.25/$2.00), standard ($1.25/$10.00), and pro ($15/$120). That four-tier spread within a single provider simplifies integration — one API key, one SDK, four price points.

OpenAI also offers Batch API with 50% discounts for non-time-sensitive workloads. Combine routing with batching for maximum savings: route simple tasks to nano, batch the non-urgent ones, and your effective cost drops below $0.025 per million input tokens.

Anthropic

Claude's strength is the massive quality jump between Haiku 4.5 ($1/$5) and Sonnet 4.6 ($3/$15). Haiku handles most mid-tier tasks admirably, and Sonnet is one of the strongest models at its price point. Prompt caching gives you 90% off cached reads — if your routing layer reuses system prompts across requests (which it should), you're effectively paying $0.10 per million cached input tokens on Haiku.

Reserve Opus 4.6 ($5/$25) for tasks that genuinely need it. It's only 67% more expensive than Sonnet on input but the quality gap is narrower than the price gap suggests for most tasks.

Google

Gemini's pricing is aggressive. Gemini 2.0 Flash at $0.10/$0.40 rivals nano-tier pricing with mid-tier quality. Gemini 2.5 Flash ($0.30/$2.50) with thinking capabilities competes with models twice its price. And Gemini 2.0 Flash-Lite at $0.075/$0.30 is the cheapest option from any major provider.

The catch: Google's API occasionally has higher latency variance than OpenAI or Anthropic. Factor this into latency-sensitive routing.

DeepSeek and open-weight models

DeepSeek V3.2 at $0.28/$0.42 is arguably the best value in AI right now. Output tokens at $0.42 per million is cheaper than most providers' input pricing. If your routing layer can tolerate slightly higher latency (DeepSeek's API is hosted in China with global CDN), it's an incredible mid-tier option.

For self-hosted deployments, Llama 4 Maverick and Mistral models via local inference can reduce per-token costs to near zero after hardware amortization — though you trade operational simplicity for cost savings.


Common routing mistakes to avoid

Mistake 1: Routing by model name instead of capability

Don't think "send hard stuff to GPT-5.4 and easy stuff to GPT-5 nano." Think "send classification tasks to Tier 1 and reasoning tasks to Tier 3." Model names change. Capability tiers don't.

When a new model launches, you should be able to slot it into the right tier without rewriting your routing logic.

Mistake 2: Over-optimizing too early

Don't build a sophisticated ML-based router before you have 10,000 requests. Start with static routing, measure for a month, then optimize. The first 80% of savings come from the simplest routing rules.

Mistake 3: Ignoring output token costs

Input tokens get all the attention, but output tokens are 2-8× more expensive on most models. A request with 100 input tokens and 2,000 output tokens is dominated by output cost. Route based on expected output length, not just input complexity.

For example, a "write a 2,000-word blog post" request should go to a model with competitive output pricing. DeepSeek V3.2 at $0.42/M output tokens versus Claude Sonnet 4.6 at $15/M output tokens is a 35× difference on the output side.

Mistake 4: Not testing quality per tier

"It works fine on GPT-5.4" doesn't mean it works fine on GPT-5 nano. Build an evaluation suite for each route. Run the same 100 test cases through each candidate model. Measure accuracy, relevance, and format compliance. Only route to a cheaper model when its quality scores are within acceptable bounds.

✅ TL;DR: Start simple, measure everything, upgrade complexity only when data justifies it. The biggest savings come from the first routing rule, not the last optimization.


Tools for implementing model routing

You don't need to build everything from scratch. Several tools handle the infrastructure:

  • LiteLLM — Open-source proxy that normalizes 100+ LLM APIs. Add routing rules on top of its unified interface. Free.
  • Portkey — AI gateway with built-in routing, fallbacks, and cost tracking. Has a routing engine that can load-balance across models.
  • Martian — Specifically designed for model routing. Uses a meta-model to pick the best model for each request.
  • OpenRouter — API that provides access to hundreds of models through a single endpoint, making it easy to switch between tiers.
  • Custom proxy — A lightweight Express/FastAPI server with a routing config. Under 200 lines of code for basic static routing.

For cost tracking across routed models, use our AI Cost Calculator to compare per-request costs and find the cheapest model for each tier.


Frequently asked questions

How much does AI model routing actually save?

Most teams see 50-80% cost reduction compared to single-model architectures. The exact savings depend on your traffic mix — applications with high volumes of simple requests (customer support, classification, extraction) save the most. A typical SaaS app routing 50% of traffic to nano/flash models, 35% to mid-tier, and 15% to flagships pays roughly $0.50-$1.50 per million input tokens blended, compared to $3-$5 for a single flagship.

Does model routing add latency?

Static routing (mapping endpoints to models) adds zero latency. LLM-based classification adds 100-200ms per request. For most applications, this is negligible — the LLM generation itself takes 500-3000ms. If latency is critical, use heuristic routing or a pre-trained classifier instead of an LLM classifier.

Which models work best for each routing tier?

For Tier 1 (simple tasks): Mistral Small 3.2 ($0.06/$0.18) and GPT-5 nano ($0.05/$0.40) offer the best value. For Tier 2 (moderate tasks): DeepSeek V3.2 ($0.28/$0.42) is unbeatable on price-to-quality ratio. For Tier 3 (complex tasks): GPT-5.4 ($2.50/$15) and Claude Sonnet 4.6 ($3/$15) balance capability with reasonable pricing. Check our model comparison tools for the latest pricing.

Can I use model routing with streaming responses?

Yes. Most LLM proxy tools (LiteLLM, Portkey) support streaming across all providers. Your routing layer makes the model decision before the request is sent, so streaming works identically regardless of which model handles it. The key is ensuring your provider abstraction normalizes streaming formats across OpenAI, Anthropic, and Google APIs.

How do I know if a cheaper model is "good enough" for a task?

Build a test suite of 50-100 representative inputs for each route. Run them through your candidate models and grade the outputs on accuracy, relevance, format compliance, and tone. If the cheaper model scores within 5-10% of the expensive one, it's good enough. Review weekly by sampling production outputs. Most teams are surprised by how capable budget models are on well-defined tasks — the quality gap is much smaller than the price gap.


Start routing today

You don't need a perfect system on day one. Here's your action plan:

  1. Audit your current usage. What percentage of your requests are simple classification, extraction, or formatting? Those are your immediate routing candidates.
  2. Pick two tiers. Start with just "simple" and "everything else." Route simple tasks to GPT-5 nano or Mistral Small 3.2. Keep everything else on your current model.
  3. Measure for two weeks. Track cost per request, quality scores, and user feedback for both tiers.
  4. Add a mid-tier. Split "everything else" into moderate (GPT-5 mini or DeepSeek V3.2) and complex (keep your flagship).
  5. Iterate. Review routing decisions weekly. Adjust thresholds based on data.

Use the AI Cost Calculator to model your costs across different routing configurations before you implement. Plug in your actual traffic volumes and see exactly how much each routing strategy saves.

The models are getting cheaper every quarter. The question isn't whether to implement routing — it's how much money you're leaving on the table by not doing it today.