Reasoning Models Cost 15x. Adaptive Depth Saves 60%

Send one complex query to OpenAI o3 and it can burn 10,000 to 50,000 reasoning tokens before emitting a single visible word — all billed at the $60-per-million output rate, all hidden in a thinking block that never appears in the response (source). Reasoning models are the single biggest line-item shift in inference economics since transformers, and most teams pay for them without ever measuring what the extra thinking actually bought.

Reasoning Tokens Are the Hidden Bill

Production agents in 2026 run on a three-tier token economy, and only one tier is silent. Input tokens show up in your prompt. Output tokens show up in your response. Reasoning tokens — the chain-of-thought the model generates inside a thinking block before answering — are billed at full output rates but never surface in the user-facing payload (source). On frontier pricing that is $15–60 per million tokens consumed invisibly. A uniformly-configured coding agent generates 2,000–10,000 thinking tokens per step regardless of difficulty; at Claude Opus output pricing that is roughly $9 in pure thinking across a 15-step loop, before a single useful token ships (source).

The asymmetry is the trap. Reasoning tokens are 4–8x the price of input tokens and are spent on every step, easy or hard. If only four of fifteen steps in an agent loop actually need deep reasoning, the other eleven are financing answers a classifier could have produced for a tenth of the cost.

Three Scaling Laws, One Goes Up

Until 2024 there was one scaling law: more parameters and more pretraining compute bought better answers. The reasoning-model wave added a second axis that platforms now expose explicitly in their APIs. OpenAI o3 ships a reasoning.effort parameter (low / medium / high) tunable per request. Anthropic’s extended thinking takes a budget_tokens value, with a 1,024-token floor. Gemini exposes thinkingConfig with a budget (source). Performance improves logarithmically with the thinking budget — meaning the first 1,000 tokens of reasoning buy most of the gain, and the next 100,000 buy a sliver (source).

The infrastructure consequence is concrete. Inference is projected to account for two-thirds of all AI compute in 2026, up from half in 2025, driven almost entirely by reasoning workloads (source). A peer-reviewed HPCA-32 2026 analysis of agentic test-time scaling found that while accuracy keeps rising with compute, the returns diminish rapidly, latency variance widens, and infrastructure costs become unsustainable without a deliberate compute-efficiency discipline (source). The vendor pitch — “just let it think longer” — is true and ruinous at the same time.

The Math of Always-On Thinking

Run the arithmetic on a real agent loop. A software-engineering agent reading an error log, locating a failing test, navigating to the source, diagnosing the root cause, writing a fix, and verifying it has six steps. Only diagnosis and the fix genuinely need deep reasoning; the rest are pattern-match or command-execution work. A reasoning model configured at a fixed effort generates 2,000–10,000 thinking tokens per step anyway (source). Across a 15-step loop at frontier output pricing, that is roughly $9 in thinking tokens alone, of which perhaps $5 was spent on steps where reasoning added nothing.

Scale that to thousands of daily runs and over-thinking becomes a P&L line item, not a quality decision. OpenAI’s o3-pro, tuned for maximum reasoning depth, runs roughly $280 a month in typical agent workloads — 3.6x more than o3 and 18x more than o4-mini — almost entirely because of reasoning-token volume (source). The capability gap between o3-pro and o4-mini is real on competition math. It is irrelevant on most production agent steps.

Token tier	Frontier cost	Efficient-model cost	Visible to user?
Input	$3–15 / M	$0.10–0.25 / M	Yes, in prompt
Output	$15–60 / M	$0.40–1.50 / M	Yes, in response
Reasoning	$15–60 / M	$0.40–1.50 / M	No — hidden in thinking block

The hidden visibility is why the bill surprises teams. This is the same category of invisible-cost failure we dissected in our piece on agentic workflows costing 5x the budget — the spend compounds inside abstractions the dashboard never breaks out.

Latency Is the Second Tax

The cost number is only half the damage. Latency for a standard model sits at 1–3 seconds; a reasoning model on the same prompt runs 10–90 seconds, with variance that widens as the thinking budget grows (source). That rules reasoning models out of live chat and synchronous request paths entirely — they belong in background processing, second-stage triage, and batch analysis. Putting o3 in front of a user-facing endpoint does not just cost more, it breaks the interaction model.

The latency tax compounds in agent loops because each step waits on the previous step’s full reasoning budget. A 15-step loop where every step thinks for 60 seconds is a 15-minute request, not a 15-second one. This is exactly the multi-step degradation pattern we documented in multi-agent reliability dropping from 85% per step to 20% at step 10: when every step both thinks long and depends on the last, reliability and latency collapse together. Adaptive reasoning is not just a cost optimization here — it is a reliability tool, because the steps that think fastest are the steps that fail least.

Adaptive Depth: The 60% Cut

The fix is not a smaller model. It is routing reasoning effort per step. CogRouter (February 2026) and ARES (March 2026) train agents to dynamically adapt reasoning depth and hit state-of-the-art task performance while using 50–62% fewer tokens (source). The “Reasoning on a Budget” survey and UC Berkeley’s compute-optimal scaling work show adaptive allocation delivers roughly 4x better efficiency than any fixed-budget approach (source). A production agent that learns when to think hard and when to think fast cuts its inference bill by 50–80% while holding or improving task success (source).

The reason this works is the logarithmic scaling curve. The first 1,000 thinking tokens capture most of the accuracy gain on a hard step; the next 49,000 buy margin. On an easy step, even 1,000 is waste. An adaptive router that spends the full budget only on the genuinely hard steps — root-cause diagnosis, multi-constraint code generation, mathematical verification — captures the gain on the steps that matter and pays nothing on the ones that do not. The token reduction is not a quality tradeoff; it is removing spend the model was never converting into correctness.

Build a Per-Request Routing Layer

The provider APIs already hand you the levers — reasoning.effort, budget_tokens, thinkingConfig — so the engineering work is a routing layer that classifies each step and sets the budget accordingly. Best practice in mid-2026 is to activate reasoning only on requests with genuine multi-step logic: complex VAT cases, distributed-systems debugging, contract analysis, mathematical verification (source). On summarisation, classification, and simple writing, reasoning raises cost without raising quality. A routing layer — LiteLLM, OpenRouter, or a custom classifier — decides per request whether thinking is on, at what effort, and for how many tokens.

Classify step difficulty first. Pattern-match steps (log reading, string search, command execution) get a fast model or zero reasoning budget. Only diagnosis and generation steps unlock deep thinking.
Start every reasoning budget at the floor. Claude’s 1,024-token minimum is where the logarithmic curve is steepest. Increase incrementally only when measured accuracy on a held-out set justifies it.
Make reasoning tokens visible in your own telemetry. OpenAI hides them from the response, but the usage object reports them. Break them out as a separate cost line so over-thinking shows up before the invoice does.
Keep reasoning out of synchronous paths. The 10–90 second latency band is incompatible with chat. Route reasoning to background queues and batch pipelines.

Test-time compute is not going away — it is how frontier capability is bought in 2026. The teams that win on it are not the ones who think longest, they are the ones who think longest only where it pays.

Cloud AI