Context Engineering: Why AI Agents Fail at Step 47

o3 Drops 34 Points Across Turns OpenAI’s o3 model scores 98.1 on single-turn benchmarks. Distribute the same information across multi-turn exchanges — the way actual agents work — and that score collapses to 64.1. That’s a 34-point absolute drop, and it’s not an outlier. Across all tested models, multi-turn context …