Rate Limits Hit 60% of LLM Errors. Retries Amplify Damage

The Scale of the Problem

In February 2026, nearly 60% of all LLM production errors tracked by Datadog were caused by rate limits — not model failures, not hallucinations, not context window overflows. Rate limits. HTTP 429s. By March that number dropped to 30%, but organizations still logged approximately 8.4 million rate limit failure events in a single month. If you run LLM workloads in production, this is your biggest reliability problem, and you probably don’t have visibility into it.

Datadog’s State of AI Engineering 2026 report, built on telemetry from thousands of production AI systems, paints an uncomfortable picture: as enterprises scale LLM adoption, rate limit failures scale faster. The recurring pattern involves sudden concurrency spikes, shared infrastructure capacity quotas across teams, and retry cascades that amplify single failures into prolonged outages.

Here’s the context that makes this hurt. LLM API providers operate at roughly 99.0–99.5% uptime. Big-three cloud providers average 99.97% — about 2.5 hours of downtime per year versus 3.5 days. API uptime across the LLM industry fell from 99.66% to 99.46% between Q1 2024 and Q1 2025 — a 60% increase in downtime year-over-year as demand growth outpaced infrastructure scaling.

Why Retries Make Things Worse

The most common resilience mistake in LLM applications isn’t the absence of retries. It’s retries without jitter, without layering discipline, and without a budget.

When a rate limit or timeout fires, a naive retry loop hits the same overloaded endpoint immediately, exhausts the retry budget within milliseconds, and produces zero recovery window. In multi-service architectures, the amplification compounds: three retries at each layer of a five-service call chain produce 35 = 243 backend calls for each original user request. About 40% of cascading failures in distributed systems trace back to retry logic, according to analysis by Tian Pan. The original problem was minor. The retry behavior made it fatal.

The fix has three components:

  • Exponential backoff with full jitter. Pure exponential backoff without jitter synchronizes all clients to retry at the same moment, recreating the thundering herd on every attempt. The working formula: sleep = random_between(0, min(cap, base * 2^attempt)). Start with attempt 1 waiting up to 1 second, attempt 2 up to 2–3 seconds, attempt 3 up to 4–6 seconds, cap at 32–60 seconds with a maximum of 3–5 attempts. AWS’s architecture blog demonstrates why full jitter outperforms decorrelated backoff under contention.
  • Retry at one layer only. If your application calls a service that calls another service, retries at every hop multiply. Pick the outermost application layer as the only place retries happen. Internal layers should propagate failures cleanly.
  • Enforce a retry budget. Total retries should not exceed 10% of total requests at any given time. If the retry rate exceeds the budget, fail fast. This prevents one degraded endpoint from pulling down everything else.

One more thing: never retry 4xx errors blindly. A 400 or 403 will fail every time. The only 4xx worth retrying is 429 (rate limit), and even then, read the Retry-After header before choosing a wait duration. OpenAI’s own guidance recommends exponential backoff and respecting provider-supplied reset times rather than guessing.

TPM vs RPM: The Dual Axis

LLM rate limits operate on two independent axes simultaneously, and most teams only think about one of them.

RPM (Requests Per Minute) limits the number of API calls. It protects infrastructure from request floods. TPM (Tokens Per Minute) limits compute consumption. It protects GPU capacity from workloads with long prompts or extensive agent chains. You can stay within RPM while blowing past TPM, and vice versa. Both produce a 429, but the underlying cause and the correct response differ.

For agents and RAG pipelines, TPM is almost always the binding constraint. A pipeline that retrieves 20 documents and stuffs them into a 15,000-token prompt burns TPM at roughly 15x the rate of a short-form query, even at the same request count. Token-heavy prompts monopolize GPU inference time and force smaller requests to wait in the queue.

Production-grade token management requires three things:

  1. Pre-flight token estimation (using tiktoken for OpenAI, provider-specific equivalents elsewhere) to reject or queue requests before they blow the budget.
  2. Setting max_completion_tokens to cap output. Without this, a model that decides to write an unusually thorough response can silently exhaust your TPM budget on a single request.
  3. Dual rate limiting at the application layer, not just at the provider edge. Enforce both RPM and TPM in your own code, with a queue that smooths burst traffic using Redis or Kafka rather than shedding it.

Azure deployments add a hidden dimension: per-instance limits and shared regional caps are independent. A deployment with five Azure instances each configured for 450K TPM on GPT-4o may still hit a region-wide limit that caps all instances combined at 300K TPM. This is not documented prominently and is typically discovered under load.

Circuit Breakers Save LLM Systems

A circuit breaker sits between your application and the LLM provider. In normal operation (closed state), all requests pass through. When the failure rate exceeds a threshold over a rolling window — say, more than 20% of requests fail over the last 60 seconds — the circuit trips open. In the open state, requests fail immediately without touching the provider, giving it time to recover. After a cooldown period, the circuit enters half-open state and allows a small fraction of test traffic through to probe whether recovery has occurred.

The concrete production impact is significant. For an application making 100 requests per minute during a five-minute outage:

  • Without circuit breaker: 500–1,000 requests hang for 30 seconds each waiting for timeouts. Users experience degraded responses throughout the outage.
  • With circuit breaker: after roughly 10–15 failed requests trip the threshold, the remaining ~485 requests fail fast in under 10ms. Fallback logic engages immediately. Users see a 200ms response from the secondary provider rather than a 30-second timeout.

For LLM applications, standard HTTP circuit breaker triggers (error rate, consecutive failures, latency P95) are necessary but not sufficient. You need additional triggers specific to AI workloads:

  • Cost per request exceeding a threshold. In mid-2025, a team building a multi-agent financial assistant saw API spend climb from $127/week to $47,000/week because an agent loop ran recursively for eleven days with no circuit breaker to catch it.
  • Conversation turn count. Break circuits at 20+ turns in an agentic conversation. Legitimate reasoning chains rarely need more; runaway loops almost always do.
  • Output quality score. Requires a lightweight LLM-as-judge running on outputs before they reach the user.

Multi-Provider Failover Is Not Optional

By mid-2025, 40% of production LLM teams had multi-provider routing in place, up from 23% just ten months earlier. The main forcing function was a series of notable provider outages — including multi-hour incidents at both major foundation model providers — that left single-provider applications completely dark while multi-provider applications failed over in seconds.

There are two failover architectures worth knowing:

  • Sequential failover: primary → secondary → tertiary. Simple to implement. The cost is ~1–3 seconds of additional latency per hop, acceptable for non-interactive workloads.
  • Racing (fan-out): fire requests to primary and secondary simultaneously; use whichever responds first; cancel the other. Eliminates the latency penalty but roughly doubles token cost. Reserve this for interactive use cases where first-token latency is the primary SLO.

The engineering challenges are underappreciated. Every provider has different error formats, rate-limit headers, and response schemas. A library like LiteLLM normalizes these, but at ~2,000 RPS its memory usage climbs past 8 GB. Higher-throughput environments need purpose-built gateways (Portkey, Bifrost in Go, or Bedrock for AWS-native stacks).

Fallback models may produce structurally different outputs. Falling back from one model to another during an outage can break downstream JSON parsers if the models format responses differently. Cost can spike dramatically during failover — if your primary is the cheapest option, a 10-hour outage during peak traffic on a more expensive secondary generates significant unexpected spend.

Building a Minimum Resilience Stack

An LLM call in a production system should pass through a request queue that enforces dual TPM/RPM limits, then through a circuit breaker with error-rate and cost-threshold triggers, then to a gateway that handles exponential backoff with full jitter and can route to a secondary provider on 429s, 5xxs, or latency threshold breaches. Outputs should be schema-validated before returning to the application, with a percentage sample sent to a quality monitor.

This is not exotic infrastructure. It is the same distributed systems engineering that makes HTTP microservices reliable — applied to a new category of external dependency that happens to be slower, more expensive per call, and more likely to silently degrade than most services engineers have worked with.

The specific monitoring signals that matter for LLM workloads:

  • Schema validation on every response. If your application expects structured JSON, validate the schema on every response. Schema failures are a leading indicator of model regression.
  • Semantic quality sampling. Run a small percentage of responses through a lightweight quality assessment. A drop in quality scores before a drop in HTTP success rates is an early warning signal.
  • Embedding drift tracking. Track the semantic distribution of responses over time. Sudden drift in output embeddings — even when outputs are syntactically valid — indicates something changed upstream.

In August 2025, an LLM provider published a postmortem documenting three simultaneous bugs that had been degrading response quality for weeks. None of them were hard errors. HTTP success rates looked normal throughout. They were detected via user complaints and manual investigation. A single quality sampling pipeline would have caught all three within hours instead of weeks.

The teams that built this resilience stack are the ones whose applications kept serving users during every major provider outage in 2025 and 2026. The teams that didn’t are the ones writing incident postmortems about why their application was down for ten hours when the provider was down for ten hours. The provider outage is not optional. The circuit breaker is.

References