Long Context Models Drop 40% Accuracy Past 200K Tokens

DeepSeek V4-Pro scores 78% on single-needle retrieval at 1M tokens. On multi-needle retrieval — the test that resembles what production actually looks like — it collapses to 41%. GPT-5.5 falls from 96% to 74%. Claude Opus 4.7 falls from 89% to 56%. Only Gemini 3 Deep Think holds its position. …