70% of Teams Run 3+ LLMs in Production. Nobody Knows How to Retire Them.

OpenAI’s Market Share Dropped From 75% to 63% in One Year — And That’s the Least Interesting Part

Datadog’s 2026 State of AI Engineering report, released in April 2026, analyzed LLM telemetry across more than a thousand production environments. The headline finding: 70% of organizations now run three or more models simultaneously, and the share running six or more nearly doubled year-over-year. OpenAI still leads at 63% provider share, but Google Gemini and Anthropic Claude each gained over 20 percentage points in adoption. The real story isn’t the provider horse race — it’s that teams are accumulating model fleets faster than they can govern them, and the operational consequences are compounding [Datadog State of AI Engineering 2026].

The Multi-Model Portfolio Is the New Normal

Model diversification has moved from theory to production reality. According to the Datadog dataset, the number of Datadog customers using OpenAI more than doubled even as OpenAI’s share dropped from 75% to 63%. This isn’t a zero-sum game — it’s expansion across the board. Teams are building model portfolios, routing lightweight extraction tasks to cheap models and reserving frontier models for synthesis and complex reasoning [Datadog].

The engineering implication is significant. Each additional model in your fleet introduces its own latency profile, cost curve, failure mode, and evaluation burden. The same prompt, the same tool definitions, the same agent workflow — all produce different results across models. The report explicitly warns that teams are “adding new models faster than they are simplifying their fleets,” turning model churn into a governance problem [Datadog].

Organizations that treat inference as a pipeline — with a model gateway handling routing, fallback, and provider-agnostic evaluation — are the ones pulling ahead. OpenRouter, which provides a unified API across hundreds of models, reported that its customers want to “switch quickly, test freely, and discover the best model for their workflows” [OpenRouter Documentation]. That’s the operational pattern that scales.

Model Tech Debt Compounds Faster Than You Think

The Datadog report surfaces a pattern that will feel familiar to anyone who’s managed a microservices migration: teams adopt new models quickly but retire old ones slowly. Claude Sonnet 4.6 reached 17% adoption in its first month. Meanwhile, Sonnet 4.5 and GPT-4o still sat at 19% and 22% adoption respectively as of March 2026 — comparable to their newer replacements [Datadog].

GPT-4o remains the single most common model in the Datadog trace dataset, despite OpenAI already retiring it from the ChatGPT UI. Its API future is uncertain. Teams running production agents on GPT-4o are carrying a deprecation risk they may not have quantified. When providers sunset models — as they inevitably do — every agent relying on that model needs re-evaluation, prompt tuning, and potentially architectural changes. Multiply that across six or more models in a fleet, and you have a maintenance burden that nobody budgeted for.

The fix isn’t “use fewer models.” It’s treat your model fleet like a dependency tree. Track which agents depend on which models, maintain evaluation benchmarks for each, and schedule model migrations with the same discipline you’d apply to a major library upgrade. If you don’t have a model registry with deprecation alerts, build one before your next provider announcement catches you off guard.

Agent Framework Adoption Doubled — And So Did Operational Complexity

Framework adoption among Datadog customers nearly doubled year-over-year, rising from 9% of organizations in early 2025 to almost 18% by early 2026. The count of services using agentic frameworks more than doubled in the same period. LangChain, LangGraph, Pydantic AI, Vercel AI SDK, OpenAI Agents, CrewAI, and dozens of others are now production dependencies [Datadog].

The report highlights a critical tension: frameworks accelerate building but introduce “costly operational complexity.” Tool fan-out, retries, and branching are one import away. Agent sprawl sets in as the framework adds execution steps under the hood, and cost and latency drift upward without anyone making a conscious decision to increase them. Failures become harder to reproduce because the control flow is driven by the LLM itself [Datadog].

Vercel CEO Guillermo Rauch puts it directly: “The next wave of agent failures won’t be about what agents can’t do. It’ll be about what teams can’t observe.” The solution isn’t avoiding frameworks — it’s instrumenting them aggressively. Comprehensive agent telemetry, distributed traces across service boundaries, and service maps that include tool calls are the minimum viable observability stack for production agents [Datadog].

Your System Prompts Are Eating Your Token Budget

69% of all input tokens in the Datadog customer traces were system prompts — internal instructions, policy definitions, and tool guidance executing down the chain from the initial user query. Let that sink in: more than two-thirds of token spend goes to repeating the same scaffolding across every call [Datadog].

Prompt caching should be the obvious fix. Yet only 28% of LLM call spans show any cached-read input tokens — even among models that support caching. The majority of calls re-process the full prompt every single time [Datadog].

The usual culprit is prompt layout. If dynamic content is injected too early in the prompt, or stable blocks get reordered between requests, the prefix reuse that enables caching breaks. The fix is architectural: place static content (system instructions, tool schemas, policies) at the beginning of the prompt, followed by dynamic context. Modularize reusable components. Measure cache-hit rates per model and per agent. This isn’t rocket science, but it requires deliberate engineering that most teams haven’t prioritized [Datadog].

Rate Limits Are the Dominant Failure Mode — And They Compound

In February 2026, 5% of all LLM call spans in the Datadog dataset reported an error, and 60% of those errors were rate limit violations. By March, the overall error rate dropped to 2%, but rate limits still accounted for nearly a third of all failures — 8.4 million rate limit errors in a single month across the observed customer base [Datadog].

The problem compounds in agent systems. Long-lived agent loops using ReAct methodologies or collaborative multi-agent patterns hit provider rate limits, trigger retries that increase load further, and can evolve a transient throttling event into a sustained system failure. Organization-wide capacity quotas shared across teams make this worse — a burst from one agent can exhaust allocated capacity for every other agent in the org [Datadog].

The operational playbook is straightforward but underimplemented: implement token budgets and call-count limits to prevent runaway loops. Build queue systems with exponential backoff at the gateway layer. Maintain fallback capacity with alternative models or providers. And design prompts and agent logic to avoid unnecessary spikes in loop length and tool fan-out. If your dominant production failure mode is capacity, your dominant engineering priority should be capacity planning [Augment Code: AI SRE Guide 2026].

59% of Production Agents Are Still Monoliths

Despite the buzz around multi-agent architectures, 59% of agentic application requests in the Datadog dataset made only a single service call. Just 18% made three or more service calls. Production agents are still largely monolithic [Datadog].

This isn’t necessarily bad — a single-purpose agent that does one thing well is often more reliable than a distributed system of agents that coordinate poorly. But it does mean that the infrastructure for multi-agent systems (cross-service trace propagation, context sharing, service maps that include tool boundaries) is still immature in most organizations. The teams building dedicated agent services with microservice-style interfaces are the exceptions, not the rule.

Platform engineering teams should expect this to shift rapidly. As agent complexity grows and the limitations of monolithic patterns become painful, the demand for distributed agent infrastructure will spike. Getting ahead of it means investing now in the primitives: trace propagation across agent boundaries, shared context stores, and service mesh patterns adapted for non-deterministic workloads [Absolute Ops: Platform Engineering 2026].

What Actually Matters for Engineering Teams

The Datadog report paints a picture of an industry that moved fast on adoption and is now facing the operational bill. Here’s what to do about it:

  • Build a model registry. Track every model in production, which agents depend on it, and when it was last evaluated. Set deprecation alerts. Treat model sunsets as scheduled incidents.
  • Audit your prompt layout. If 69% of your tokens are system prompts and your cache-hit rate is below 50%, you’re burning money on re-processing identical content. Restructure prompts for prefix reuse.
  • Instrument before you scale. If you’re using agent frameworks, comprehensive telemetry isn’t optional — it’s the only way to debug non-deterministic control flow in production.
  • Capacity-plan for rate limits. Budgets, backpressure, and fallback routing should be in your agent runtime, not bolted on after the first production incident.

The teams that treat AI production engineering with the same rigor they apply to distributed systems — capacity planning, observability, dependency management, graceful degradation — will be the ones whose agents actually work at scale. Everyone else will keep accumulating model tech debt and wondering why their bill keeps climbing [Datadog].

References