Why API Routing Fails at Scale: The Compute Arbitrage Tradeoff
Hitesh Sondhi · May 6, 2026 · 12 min read
We’ve seen teams celebrate a 28% inference cost drop on Thursday and get paged for a customer-facing outage by Monday.
That’s the dirty little secret of multi-provider AI routing: the spreadsheet looks brilliant right up until production starts acting like a budget airline during a thunderstorm. Cheap seats, delayed arrivals, missing luggage. Everybody’s angry, and somehow it’s still “within policy.”
If you’re chasing compute arbitrage: why api routing seems so attractive, you’re not wrong. The idea is seductive: send each request to the cheapest or fastest model endpoint, exploit pricing gaps, dodge outages, and print margin. In theory, it’s smart. In practice, once traffic gets real, the tradeoff between cost, latency, and reliability stops being a neat optimization problem and starts behaving like a knife fight.
Key Takeaways
- API routing works well in demos and often gets messy in production because provider behavior changes faster than your routing rules.
- The cheapest token price is often a lie once you factor in retries, timeouts, prompt bloat, and failed tool calls.
- Latency variance matters more than average latency for user-facing AI systems, especially voice and agent workflows.
- Reliability isn’t just uptime. It’s schema stability, rate-limit behavior, output consistency, and how ugly failure recovery gets.
- If you want real compute arbitrage, you need routing, observability, fallback logic, and often a custom or on-device path—not just more vendors.
Why “Just Route to the Cheapest Model” Is Usually Bad Advice
The first mistake is treating AI APIs like commodity electricity. They’re not. They’re more like airline tickets sold by carriers that keep changing the route, the baggage rules, and whether the pilot speaks your language.
Two providers can advertise similar token pricing and still produce wildly different total cost per successful task. One model may need 30% more output tokens. Another may require longer prompts to stay on-format. A third may be cheap until you hit retries because its structured output fails under load.
That’s where compute arbitrage: why api routing gets misunderstood. People think arbitrage means “buy low, sell high.” In AI systems, it usually means “buy low, then pay hidden fees in latency, instability, and engineering time.”
And engineering time is not free. It’s the line item everyone pretends doesn’t exist because it ruins the deck.
The Hidden Math Behind API Credit Systems
A lot of providers still make pricing feel simpler than it is. You get credits, token bundles, throughput tiers, or some vague “optimized” rate card. Nice marketing. Terrible operational clarity.
For LLM APIs, compute usually dominates cost, but not always. If your pipeline does retrieval, stores embeddings, ships large context windows, or moves audio around, bandwidth and storage can become meaningful cost drivers too. AWS has long documented that data transfer pricing can materially affect system cost depending on architecture and traffic patterns AWS Data Transfer Pricing and storage/retrieval patterns can matter in services like S3 Amazon S3 Pricing.
Here’s the practical math we use when evaluating routing decisions:
Effective cost per successful task =
(request cost + response cost + retrieval cost + tool-call cost + retry cost + timeout waste + orchestration overhead) / success rate
That denominator is where dashboards go to die.
If Provider A costs $0.002 per request and succeeds 97% of the time, while Provider B costs $0.0016 but succeeds 82% of the time and needs more retries, Provider B may be more expensive in reality. This isn’t theory. We’ve watched “cost-optimized” routes quietly become the most expensive path in the system because they generated failure loops in downstream services.
Here’s a simple way to picture it:

The ugly part is that API credits make this worse. Credits hide per-operation economics behind prepaid abstractions. You think you’re spending from a pool. You’re actually making dozens of micro-bets on model behavior, throughput, and failure recovery.
But cost is only one-third of the mess.
Average Latency Is a Vanity Metric
We’ve got a hot take here: p50 latency is borderline useless for AI product decisions.
If your chatbot responds in 900ms half the time but takes 8 seconds at p99, users don’t care that your average looks pretty. They think your app is broken. Google has repeatedly published on the impact of latency on user experience and business outcomes, including how speed affects satisfaction and conversion Think with Google.
For interactive systems, especially voice, tail latency is the whole game. In voice interfaces, even a few hundred milliseconds can make turn-taking feel awkward; that’s why streaming and endpointing matter so much in speech systems OpenAI Realtime API docs and Deepgram latency discussions. We’ve seen this firsthand in voice AI work: a model that’s “cheap enough” on paper becomes unusable when response timing turns conversations into hostage negotiations.
Short version: users experience p95 and p99, not your nice-looking average.
That’s why compute arbitrage: why api routing often falls apart in customer-facing apps. Every extra routing hop, health probe, fallback attempt, and schema repair adds jitter. Maybe your average stays okay. Your worst-case behavior becomes chaos.
Here’s how the routing logic usually evolves once reality shows up:
flowchart TD
A[Incoming AI Request] --> B{Route by price?}
B -->|Yes| C[Cheapest Provider]
C --> D{Timeout or bad output?}
D -->|Yes| E[Fallback Provider]
D -->|No| F[Return Response]
E --> G{Schema mismatch?}
G -->|Yes| H[Repair / Retry Layer]
G -->|No| F
H --> I[Higher latency + higher cost]
I --> F
That diagram looks harmless.
It’s not.
Reliability Is More Than “Provider Uptime”
Most routing systems are built as if reliability means one thing: “If Provider X is down, fail over to Provider Y.”
That’s beginner-level thinking.
Real reliability includes output shape stability, tool-call correctness, moderation behavior, token accounting consistency, streaming behavior, context-window edge cases, and rate-limit handling. OpenAI, Anthropic, Google, and others all publish status pages and API docs, but status pages only tell you if the building is on fire. They don’t tell you whether the kitchen is serving soup in a shoe OpenAI Status, Anthropic Status, Google Cloud Status.
We tried a “smart” fallback setup once where a secondary provider took over when the primary exceeded a latency threshold. Great idea, right? Disaster. The backup model handled function calling differently, returned subtly different JSON, and our downstream validator started rejecting responses. The system was technically available and functionally broken.
That’s the worst kind of reliable.
Available nonsense.
Why Arbitrage Works Better in Finance Than in AI APIs
People borrow the word “arbitrage” from trading because it sounds sharp and profitable. Fair enough. But financial arbitrage depends on relatively clean price discrepancies and fast execution against known instruments.
AI APIs are not known instruments. They’re probabilistic services wrapped in product marketing.
A better analogy is restaurant delivery. You can compare menu prices across apps, sure. But one app has hidden fees, another sends the wrong order, another takes 55 minutes, and another says “delivered” while your food is apparently in a parallel universe. The cheapest option isn’t the best deal if dinner arrives cold and missing half the meal.
That’s why compute arbitrage: why api is a useful framing only if you include execution quality. In model markets, the spread isn’t just price. It’s price adjusted by latency, quality variance, and failure cost.
Miss that, and your “optimization” is cosplay.
Where API Routing Actually Makes Sense
We’re not anti-routing. We’re anti-naive routing.
API routing makes sense when requests are clearly segmented. Maybe summarization can go to a cheaper model, while high-stakes extraction uses a more reliable one. Maybe background jobs can tolerate slower queues, while live chat can’t. Maybe one provider is your default and another is a pressure-relief valve for burst traffic.
That’s a grown-up architecture.
We’ve seen this work especially well in agent systems where task classes are explicit and measurable. If you’re building AI agents, you can route by task type, tool dependence, or acceptable failure budget rather than just by token price. That usually beats “send everything to whoever’s cheapest this hour.”
And if latency matters a lot, routing isn’t your only lever.
Sometimes the right answer is to stop routing and move inference closer to the user.
The Part Everyone Avoids: Sometimes You Need Fewer APIs, Not More
Here’s our strongest opinion in this whole piece: over-reliance on third-party AI APIs is overrated.
Not because APIs are bad. They’re great for speed. But once your product has stable traffic patterns, strict latency requirements, or sensitive data constraints, there’s a real chance your best move is a custom deployment, a smaller specialized model, or an on-device path.
We’ve seen this especially in voice and edge scenarios. If you’re doing live interactions, shipping every turn through a remote API can be like ordering espresso beans one at a time through international freight. Technically possible. Operationally stupid.
If that sounds familiar, this is where on-device AI, voice AI, or custom models start to matter. Not because they’re trendy, but because they let you trade vendor variability for system control.
Our product RunHotel lives in that world. When the interaction is spoken, real-time, and customer-facing, latency spikes aren’t a rounding error. They’re the product.
But there’s another trap waiting.
Rate Limits Are a Fake Fixed Cost Until They Aren’t
A lot of teams model rate limits like static constraints: Provider A allows X RPM, Provider B allows Y TPM, so just spread traffic accordingly.
That works right up until dynamic throttling, regional congestion, or account-tier quirks show up. Then your “fixed” capacity turns into a weather forecast. Public API documentation from major providers shows that rate limits can vary by usage tier, model, and account state, not just by a single universal cap OpenAI Rate Limits Guide, Anthropic Documentation.
This is where compute arbitrage: why api gets weird in production. Sometimes the profitable route isn’t the cheapest provider. It’s the one with enough headroom to avoid queue collapse during a traffic spike.
We’ve seen teams squeeze cost down 15% and accidentally create a routing pattern that amplifies rate-limit failures during peak periods. They saved money in calm weather and lost it all in the storm.
Classic false economy.
What a Good Routing Strategy Looks Like
A good routing layer doesn’t ask, “Who’s cheapest?”
It asks, “What’s the cheapest path that still hits our latency SLO, output-quality threshold, and failure budget for this task class?”
That means you need:
- Task-based routing, not just provider-based routing
- Real-time health scoring, not binary up/down checks
- Effective cost metrics per successful outcome
- Schema validation and repair paths
- Controlled fallbacks with circuit breakers
- A non-API option when the economics justify it
If you’re still early, use a calculator before you build a science project. Our AI cost estimator is exactly for this kind of back-of-the-envelope sanity check. And if your system already looks like spaghetti with retries, caches, and vendor-specific hacks, AI consulting is usually cheaper than another quarter of guessing.
Because yes, we’ve tried guessing.
It was bad.
The Real Tradeoff: Margin vs Control
The dream of API routing is simple: more providers, lower cost, better resilience.
The reality is harsher: more providers often means more variance, more glue code, more monitoring, and more places for subtle incompatibilities to hide. You can absolutely win with routing, but only if you treat it like an operations problem, not a pricing trick.
That’s the core of compute arbitrage: why api keeps showing up in strategy conversations. The arbitrage is real, but it’s conditional. You only capture it if your system is disciplined enough to measure the full cost of execution.
Otherwise you’re not doing arbitrage.
You’re coupon clipping in a hurricane.
FAQ
Does API routing always reduce AI costs?
No, not always. It can reduce sticker price, but effective cost often rises once you include retries, longer prompts, validation failures, and latency penalties.
What metric should we optimize first: cost, latency, or reliability?
Reliability first for customer-facing systems. A cheaper model that fails unpredictably usually costs more in support load, churn, and ugly engineering workarounds.
When does on-device or self-hosted AI beat API routing?
Usually when latency is strict, traffic is predictable, privacy matters, or your workload is narrow enough for a smaller specialized model. That’s especially true for voice and edge use cases.
Is multi-provider routing worth it for startups?
Sometimes, but only if you keep it simple. One primary provider and one tested fallback is usually smarter than building a mini stock exchange for tokens in month two.
How do we know if our routing logic is broken?
Look at effective cost per successful task, p95/p99 latency, fallback frequency, schema failure rate, and user-visible error patterns. If those numbers are ugly, your routing logic is probably lying to you.
What You Should Do Next
Audit one real workflow this week. Not your average metrics. One workflow: prompt size, provider path, retries, latency distribution, output failures, and total cost per successful result.
That exercise will tell you more than twenty pricing pages.
And if you want a second set of eyes on the architecture, talk to us through Cropsly’s contact page. We’ll tell you if your routing strategy is smart, salvageable, or quietly on fire.
Because at scale, AI routing doesn’t fail because the math is hard.
It fails because the math was incomplete.





