72% of Your LLM Calls Are Re-Processing Identical Prompts

69% of Your LLM Input Tokens Are System Prompts — And You’re Paying to Re-Process Them Every Single Call

Datadog’s 2026 State of AI Engineering report, built on LLM telemetry from over a thousand production environments, contains a number that should make every platform engineer wince: 69% of all input tokens in customer traces are system prompts — internal instructions, policy definitions, and tool schemas executing down the chain from the initial user query. Yet only 28% of LLM call spans show any cached-read input tokens, even among models that support prompt caching. That means the vast majority of production LLM calls are re-processing identical scaffolding at full price, every single time (Datadog State of AI Engineering 2026).

The report isn’t another survey of aspirational roadmaps. It’s telemetry data from real production workloads. And the picture it paints is one of an industry that has moved fast on model adoption but hasn’t caught up on operational discipline. This article breaks down the seven key findings and what they mean for engineers running AI in production.

Finding 1: The Multi-Model Fleet Is Now the Default

OpenAI’s share of production LLM usage among Datadog customers dropped from 75% to 63% year-over-year. But here’s the nuance the headline misses: the absolute number of customers using OpenAI more than doubled. OpenAI isn’t shrinking — the market is expanding faster than any single provider.

The real story is multi-model portfolios. Over 70% of organizations now use three or more models, and the share using more than six models nearly doubled. Teams aren’t picking winners; they’re treating inference like a pipeline, routing different task types to different models based on latency, cost, and capability requirements (Datadog State of AI Engineering 2026).

This aligns with what Berkeley’s RouteLLM project demonstrated: routing simpler queries to cheaper models can achieve up to 85% cost reduction while maintaining 95% of frontier model quality (RouteLLM, arXiv:2406.18665). The infrastructure implication is significant — you now need a model gateway (OpenRouter, LiteLLM, Portkey, or a custom router) to manage routing, fallbacks, and capacity across providers rather than hardcoding a single API endpoint.

Finding 2: Model Churn Is a Governance Problem, Not a Migration Problem

Teams are quick to adopt new model releases — Claude Sonnet 4.6 hit 17% adoption in its first month. But old models die slowly. As of March 2026, Sonnet 4.5 and GPT-4o still held 19% and 22% adoption respectively, comparable to the newer Sonnet 4.6 and GPT-5.4. Each overlapping model in your fleet introduces its own quality, latency, and cost profile. The same prompts, tools, and agent workflows produce different results across models.

This is LLM tech debt compounding in real time. Every model you add without retiring an old one doubles your evaluation surface area. The report notes that organizations are “adding new models faster than they are simplifying their fleets” (Datadog State of AI Engineering 2026).

The engineering response: continuous evaluation frameworks that can benchmark all active models against your actual production workloads, and model gateways that make swapping or deprecating a model a configuration change rather than a code deployment. Anthropic’s own engineering guidance recommends starting with the most capable model, then systematically replacing sub-tasks with smaller models (Anthropic: Building Effective Agents).

Finding 3: Agent Framework Adoption Doubled — And So Did Operational Complexity

Agent framework adoption (LangChain, LangGraph, Pydantic AI, Vercel AI SDK, CrewAI, and others) nearly doubled year-over-year, from 9% of organizations to almost 18% by early 2026. The number of services using agentic frameworks more than doubled in the same period (Datadog State of AI Engineering 2026).

Frameworks accelerate development but introduce invisible complexity. Tool fan-out, retries, and branching are one import away. The report warns that “agent sprawl can set in as the framework adds more steps and paths under the hood and it becomes harder for engineers to understand what’s happening in the runtime”. Cost and latency drift upward. Failures become harder to reproduce.

Vercel’s CEO Guillermo Rauch puts it directly: “The next wave of agent failures won’t be about what agents can’t do. It’ll be about what teams can’t observe” (Datadog State of AI Engineering 2026). Without per-step tracing, you can’t tell whether your agent is slow because of the LLM, a tool call, a retry loop, or context bloat.

Finding 4: Prompt Caching Is the Biggest Unused Lever in Production AI

This is the finding that should trigger immediate action. Here’s the math:

69% of input tokens are system prompts (instructions, policies, tool schemas)
Only 28% of LLM calls on caching-capable models show any cached-read tokens
That means 72% of calls are re-processing stable, repetitive scaffolding at full token cost

Anthropic’s prompt caching offers up to 90% cost reduction and 85% latency reduction for cache hits (Anthropic Prompt Caching). OpenAI’s automatic prompt caching provides a minimum 50% token cost reduction for cached prefixes (OpenAI Prompt Caching). These aren’t experimental features — they’re production-ready and available on major models.

The culprit, per the report, is prompt layout. If dynamic content is injected too early in the prompt, or stable blocks get reordered between requests, the prefix reuse that enables caching breaks. The fix is structural: place static system instructions, tool schemas, and policy definitions at the beginning of the prompt. Mark cache breakpoints explicitly. Modularize reusable components. This is a prompt engineering discipline, not a code change.

Finding 5: Context Windows Exploded — But Quality Didn’t

Leading model context windows have gone from 128,000 tokens to as high as two million tokens in two years. The average tokens per request more than doubled for median customers and quadrupled for the 90th percentile. Teams are stuffing more conversation history, retrieved documents, tool outputs, and guardrails into every call.

But bigger context doesn’t mean better results. The report notes that “noise and redundancy can drown out the signal — especially when critical details get buried deep in long inputs”. Context quality, not volume, is the new limiting factor (Datadog State of AI Engineering 2026).

The implication: invest in context engineering — retrieval quality, summarization, deduplication, and clear information hierarchy. Langfuse’s production tracing data shows that 5+ step agents have significantly higher failure rates than 2-3 step agents, partly because context degrades with each step (Langfuse Tracing Documentation). Shoving more tokens at the problem doesn’t fix a retrieval pipeline that surfaces irrelevant documents.

Finding 6: Rate Limit Errors Are the Dominant Production Failure Mode

In February 2026, 5% of all LLM call spans reported an error, and 60% of those errors were rate limit (HTTP 429) failures. In March 2026, 2% of spans returned errors, with rate limits accounting for nearly a third — still amounting to 8.4 million rate limit errors in a single month across the dataset (Datadog State of AI Engineering 2026).

The dominant production failure mode of LLM applications is capacity, not code. Organization-wide quota sharing, concurrency spikes from agentic loops, and retry cascades turn periodic bursts into sustained failures. OpenAI’s tier-based rate limiting ranges from 500 RPM (Tier 1) to 10,000+ RPM (Tier 5) (OpenAI Rate Limits), but agentic workflows with variable loop lengths can exhaust these unpredictably.

The engineering playbook is clear:

Implement token budgets to force agent loops to terminate before exhausting capacity
Use exponential backoff with jitter — not naive retries that amplify load
Deploy a model gateway with fallback chains (LiteLLM, Portkey) that automatically routes to alternative providers on 429 (LiteLLM Fallbacks)
Monitor x-ratelimit-remaining headers to predict throttling before it impacts users

Finding 7: 59% of Production Agents Are Still Monoliths

Despite the industry narrative around multi-agent systems, 59% of agentic application requests make only a single service call. Only 18% make three or more service calls. Production agents are still largely monoliths — single services handling planning, tool execution, and response generation in one process (Datadog State of AI Engineering 2026).

Monoliths don’t scale well for agents. As workflows grow, you need trace propagation across service boundaries, service maps that include tool calls, and independent scaling for different agent components. The shift toward dedicated agent services and multi-agent architectures is happening, but it’s early. The teams pulling ahead are decomposing agents into microservices with clear interfaces — planning service, execution service, evaluation service — each independently observable and scalable.

Arize AI’s production observability data identifies the most common agent failure modes in monolithic architectures: infinite loops (agent repeats actions), tool call hallucination (calling non-existent tools), and context window exhaustion (Arize AI Platform). These are exactly the kinds of failures that are invisible without per-step tracing and that decomposed architectures make easier to isolate and fix.

What to Do This Sprint

If you’re running LLM workloads in production, the Datadog report gives you a clear prioritization framework:

Audit your prompt caching hit rate. If it’s below 50% on caching-capable models, restructure your prompt layout. Static content goes first. This is the highest-ROI change you can make — potentially halving your token costs with zero model behavior change.
Add per-step tracing to your agents. If you can’t see which step in an agent chain failed, you’re flying blind. OpenTelemetry GenAI semantic conventions are standardizing this (OpenTelemetry GenAI).
Implement retry budgets and fallback routing. If rate limits are your top error source, naive retries make the problem worse. Add token budgets, backpressure, and multi-provider fallback.
Start a model deprecation cadence. If you’re running 6+ models, you have evaluation debt. Pick a cadence (quarterly) to benchmark fleet performance and retire underperformers.

FAQ

What percentage of LLM input tokens are system prompts in production?

According to Datadog’s 2026 State of AI Engineering report, 69% of all input tokens in customer traces are system prompts — internal instructions, policy definitions, and tool guidance. Only 28% of LLM calls show any cached-read input tokens, meaning most organizations are paying full price to re-process identical scaffolding on every call (source).

Why are rate limit errors the most common LLM failure in production?

Datadog’s telemetry shows that rate limit (HTTP 429) errors account for 60% of all LLM call failures in production. The root cause is capacity ceilings: agentic workflows with variable loop lengths, organization-wide quota sharing, and retry cascades create unpredictable bursts that exhaust provider rate limits. The fix is capacity engineering — token budgets, exponential backoff, and multi-provider fallback routing (source).

How many models are organizations using in production?

Over 70% of organizations use three or more models, and the share using more than six nearly doubled year-over-year. OpenAI’s share dropped from 75% to 63% even as absolute usage doubled, because teams are building model portfolios rather than standardizing on a single provider. Multi-model routing through gateways like OpenRouter, LiteLLM, or Portkey is becoming standard infrastructure (source).

Are most production agents monolithic or distributed?

59% of agentic application requests make only a single service call, meaning most production agents are still monoliths. Only 18% make three or more service calls. The industry is early in the shift toward multi-agent architectures, where planning, execution, and evaluation run as independent, observable services (source).

Cloud AI