o3 Drops 34 Points Across Turns
OpenAI’s o3 model scores 98.1 on single-turn benchmarks. Distribute the same information across multi-turn exchanges — the way actual agents work — and that score collapses to 64.1. That’s a 34-point absolute drop, and it’s not an outlier. Across all tested models, multi-turn context sharding produces an average 39% performance decline, according to research compiled by Digital Applied’s 2026 agent reliability playbook. The implication is brutal: your agent isn’t failing because the model is bad. It’s failing because the context window is mismanaged.
This is the core problem that context engineering addresses. Anthropic’s Applied AI Team defined it in September 2025 as “the set of strategies for curating and maintaining the optimal set of tokens during LLM inference, including all the other information that may land there outside of the prompts.” Andrej Karpathy simplified it: “the delicate art and science of filling the context window with just the right information for the next step.” Cognition AI, the company behind Devin, called it “effectively the #1 job of engineers building AI agents.”
If you’re still optimizing prompts while ignoring what fills the remaining 90% of the context window, you’re fixing the paint job on a car with a broken engine.
Why Prompt Engineering Isn’t Enough
The distinction is architectural. Prompt engineering optimizes a single instruction string. Context engineering manages the full token lifecycle — system prompt, retrieved documents, tool definitions, conversation history, memory, and everything else competing for attention at inference time.
Production agents don’t run one call. They run loops. An agent at step 47 carries residue from steps 1 through 46: tool outputs, intermediate reasoning, retrieved chunks, error traces. Each addition pushes the model further from its training distribution. Sourcegraph’s practical guide frames it as four questions: what do we fetch, when do we fetch it, how do we compress it, and when do we throw it away. Get any of these wrong, and the model doesn’t fail gracefully — it hallucinates confidently. This is part of the broader engineering-for-reliability shift in agentic infrastructure that’s replacing the move-fast-and-break-things approach.
The token economics compound fast. By turn 30, input tokens can be 5-10x what they were at turn 1. Tool results alone can consume thousands of tokens per call. Most of that context is noise the model has to attend over before producing anything useful.
Four Agent-Specific Failure Modes
Digital Applied’s research, drawing from Chroma’s “Context Rot” benchmark and multiple vendor reports, identifies four distinct failure patterns that emerge specifically in agentic workflows:
Context Poisoning
A hallucination enters the context window and compounds across turns. The Gemini 2.5 Pokémon agent’s “goals, summary” section became “poisoned with misinformation about the game state”, causing it to pursue impossible objectives in an escalating loop. One bad output becomes the input for the next decision, and the error compounds with every turn.
Context Distraction
The model over-relies on accumulated history rather than performing novel reasoning. Onset typically occurs around 100k tokens. The agent starts repeating prior actions with low output novelty — it’s pattern-matching against its own log rather than solving the current problem.
Context Confusion
Too many tools or documents overwhelm the model’s ability to select correctly. The Berkeley Function-Calling Leaderboard showed Llama 3.1 8B failing with 46 available tools but succeeding with 19. If your agent exposes 30+ tools, you likely have a context confusion problem regardless of model size.
Context Clash
Conflicting information across turns produces contradictory outputs. A Microsoft/Salesforce study documented this in multi-turn sharding scenarios. The agent hedges, contradicts itself, or picks the wrong version of facts. Frontier-tier models are not immune.
Compression Beats Summarization
The instinct when context grows is to summarize. It’s wrong. Summarization-based compression scores 3.4-3.7 out of 5 on accuracy in production evaluations because it paraphrases away exactly the details that matter: file paths, error codes, specific numeric values, and precise decisions. JetBrains found that summarization causes 13-15% longer agent trajectories compared to verbatim compaction, because agents re-derive information that was paraphrased away.
Verbatim compaction — deleting low-information tokens while preserving surviving text character-for-character — produces measurably better outcomes. The model doesn’t need to re-discover what it already found. It just needs less noise around the signal. This aligns with findings that schema-valid output still gets 20% of values wrong — the format is correct but the substance degrades as context rots.
Anthropic’s approach is instructive. Claude Code’s auto-compact triggers at 95% context window usage, clearing stale tool calls and results while preserving the decision trail. Their memory tool writes to file-based storage outside the context window, letting the agent retrieve information on demand rather than carrying it through every turn.
Token Budgets Need Hard Limits
Databricks research found that correctness for retrieval tasks begins falling at ~32k tokens for Llama 3.1 405B, earlier for smaller models. Raw context length is a poor proxy for usable context capacity. Million-token windows extend where degradation occurs — they don’t eliminate it.
Chroma’s Context Rot research, which extended the Needle in a Haystack benchmark with semantic matching and LongMemEval, confirmed this across Claude Sonnet 4, GPT-4.1, Qwen3-32B, and Gemini 2.5 Flash. The degradation is measurable, consistent, and architectural — rooted in quadratic attention complexity, training data bias toward shorter sequences, and position encoding interpolation.
Practical token budgets for production agents should follow LangChain’s write-select-compress-isolate framework. Write important state to external storage immediately. Select just-in-time using lightweight identifiers. Compress when you approach 80% window usage, not 95%. Isolate conflicting or domain-specific context into separate subagent windows.
The numbers are concrete: RAG applied to tool descriptions — not just documents — improves tool selection accuracy by approximately 3x. Progressive disclosure, where tools are loaded in tiers (name only → full definition → execution scripts), keeps dormant tool definitions from consuming attention budget.
What Production Teams Should Change
Most teams shipping agents today have a context engineering problem they’re misdiagnosing as a model capability problem. Here’s what to do differently:
- Audit your context window at turn 20. Print every token the model sees. If more than 30% is raw tool output that hasn’t been compacted, you’re burning attention budget on noise.
- Cap available tools at 15-20 per agent. Use routing to expose subsets. If two tools could plausibly handle the same request, the model will pick wrong regularly.
- Implement verbatim compaction, not summarization. Delete raw tool outputs after extracting structured results. Keep error traces — Manus found that agents repeat mistakes when error context is removed.
- Write to external memory early and often. Every important decision, intermediate result, or structured finding should hit durable storage before it gets compacted out of the window.
- Set hard context thresholds. Trigger compaction at 75-80% window usage, not at the edge. The last 20% of context is where degradation accelerates.
The teams getting agent reliability right aren’t using better models. They’re engineering better context. The model is a constant — the context window is the variable you control. And if you’re still re-processing identical prompts on every call, your context pipeline has bigger problems than phrasing.
References
- Context Engineering: Agent Reliability Playbook 2026 — Digital Applied
- Effective Context Engineering for AI Agents — Anthropic Engineering
- Context Engineering: A Practical Guide for AI Agents — Sourcegraph
- Context Engineering for Agents — LangChain Blog
- LLM Inference Optimization: Cut Cost and Latency at Every Layer — Morph
- State of Context Engineering in 2026 — Towards AI