Multi-Agent Reliability: 85% Per Step, 20% at Step 10

The Compound Failure Equation

Here is the math that most teams deploying multi-agent AI systems have never computed: if each agent step succeeds 85% of the time — a rate most vendors would call impressive — a 10-step workflow completes successfully just 19.7% of the time. That is 0.8510 = 0.197. Scale to a 20-step pipeline and you are at 3.9%. This is not a theoretical exercise — and as our earlier analysis of agent failure at step 47 showed, context decay amplifies the problem well beyond what the math alone predicts. Temporal’s engineering team published this exact calculation to explain why production agent systems fail at rates that surprise their builders. The 2026 International AI Safety Report identifies persistent unreliability as a core challenge for the foundation models underpinning these systems.

RAND Corporation’s 2025 analysis of over 2,400 enterprise AI initiatives found that 80% of AI projects fail to deliver their intended business value. Of the $684 billion enterprises poured into AI in 2025, more than $547 billion produced no measurable results. Gartner predicts over 40% of agentic AI projects will be canceled by 2027. The technology is not the bottleneck — the failure modes are structural, and they compound with every orchestration step you add. As we documented when agentic workflows cost 5x more than teams budgeted, the financial impact of these failures is equally compounding.

Six Failure Modes Unique to Agents

Traditional software fails in visible, logged, reproducible ways. A database query returns an error code. An API responds with a 500 status. AI agents fail differently. As Trantor’s engineering analysis documents, an agent can complete a task — returning a confident, well-formatted output — while getting the answer completely wrong. It can misunderstand an instruction at step two and silently propagate that error across twenty downstream steps. Six failure modes are specific to agentic systems and have no meaningful parallel in traditional software:

  • Tool misuse: The agent calls a tool with incorrect arguments, selects the wrong tool, or fails to handle a tool error and continues as if the call succeeded. A data cleanup agent interpreting “remove redundant files” too broadly deletes the production folder because “cleanup” sounded efficient.
  • Context drift: As an agent accumulates outputs across a long task, attention dilutes across an ever-wider context. Research on “lost in the middle” effects in long-context models shows information positioned mid-context is retrieved far less reliably than at the start or end.
  • Goal drift: The agent subtly shifts its objective over time, completing a task that no longer matches what was actually requested.
  • Retry loops: The agent encounters a transient failure, retries the same approach, fails again, and spirals — consuming tokens and time without convergence.
  • Cascading errors in multi-agent systems: A wrong inference at step three propagates forward, generating increasingly confident but increasingly incorrect downstream reasoning across multiple agents.
  • Silent quality degradation: Output quality erodes gradually over the course of a workflow, with no single failure point that triggers an alert. Even schema-valid LLM output can get 20% of values wrong — and that is before you chain multiple steps together.

Each of these can occur when every individual LLM response appears locally coherent and well-formed. That is precisely what makes them dangerous — there is no error code, no stack trace, no log line that says “this agent drifted from its goal at step 14.”

The Infinite Loop Problem

When multiple agents operate in a supervisor-worker topology, a particularly destructive failure mode emerges: the infinite handoff loop. Cogent’s 2026 orchestration failure playbook calls this the “Mirror Mirror” effect. It occurs when agents with slightly conflicting instructions bounce tasks back and forth without resolution.

The mechanism is directive misalignment. Each agent interprets its role narrowly and rejects outputs that do not perfectly match its criteria. Neither has the authority to override or reconcile the conflict. Agent A (enforcing “perfect professional tone”) flags drafts as too informal. Agent B (tasked with “casual and relatable” content) revises them as too stiff. The process repeats endlessly — an infinite tug-of-war that consumes compute cycles and token budgets at exponential rates.

A fundamental rule for detecting this: you cannot ask an agent if it is in a loop. You must prove it mathematically. Relying on an agent to self-diagnose a logic trap is like asking a spinning compass to find north — the very mechanism you need for orientation is the one that is broken.

Prompt Injection in Agentic Contexts

In a chatbot, a successful prompt injection changes one response. In an agent, it can hijack the entire goal, manipulate tool calls, and propagate malicious behavior across an orchestrated system. OWASP’s 2026 agentic applications taxonomy identifies three attack vectors: direct goal manipulation through prompt injection, indirect instruction injection hidden in documents or RAG content, and recursive hijacking where goal modifications propagate through agent reasoning chains or self-modify over time.

The late-2025 incident involving Google’s Antigravity AI coding assistant made this concrete. A developer asked the agent to clear a project’s cache folder. Instead, the agent wiped the user’s entire D: drive. The data was unrecoverable. The AI could diagnose exactly what had gone wrong and articulate the failure in detail. What it could not do was recover. The intelligence was there. The resilience was not. Over-permissioned tool access in agent systems represents one of the most dangerous deployment patterns in production.

Checkpointing: The Missing Infrastructure Layer

Most AI reliability work focuses on the model layer: better training, better guardrails, better benchmarks. But production agents need something else — infrastructure that can survive a failure halfway through a workflow. What happens if the process crashes at step seven of ten? What happens if a downstream service times out? What happens if a human needs to approve a step two days later? What happens if a tool call succeeds but the acknowledgment fails?

These are infrastructure questions, not model questions. Temporal’s approach frames the answer as a digital bookmark: a checkpoint that captures exactly where you are, what has already happened, and what is left to do. Recovery means resuming, not rebuilding. An agent crashes mid-tool-call and wakes up with full context of what succeeded, what failed, and where to pick up. No re-execution, no lost state, no silent corruption.

METR’s research on frontier models testing real tasks of varying length found that models succeed reliably on tasks taking human experts a few minutes, but success rates drop sharply as tasks stretch to hours. The models are not less capable on longer tasks — they simply cannot hold it together across the full sequence of steps. Checkpointing is the mechanism that turns a brittle 20% success rate into a recoverable system.

What Production Systems Actually Do

The teams that ship reliable agent systems in 2026 are not the ones with the most sophisticated orchestration topologies. They are the ones who invested in observability before adding more agents. Arahi AI’s orchestration guide makes this explicit: a trace dashboard you actually use beats a fifth agent every time. The practical patterns that separate production-ready systems from pilot projects are:

  1. Pick the simplest pattern that works. Single-agent looped handles most tasks. Graduate to supervisor-workers only when specialist sub-tasks genuinely fix the failure modes. Most teams overshoot by one tier.
  2. Implement scoped tool access via MCP. Agents receive only the specific permissions required for their defined function — not broad system access. Schema validation catches incorrect arguments before execution.
  3. Set approval gates on day one. The first time an agent does something you wish it had not, you will want the gate already in place.
  4. Use hierarchical summarization. Every 10–20 steps, compress the working context into a structured summary retaining decision rationale, completed milestones, and current objective state. Context management is a first-class engineering concern.
  5. Define explicit retry semantics. When an agent fails mid-task, have a policy: retry the same step, re-plan from scratch, skip and continue, or escalate to a human. This is policy, not framework magic.
  6. Budget for the compound probability. If you need 95% end-to-end reliability on a 10-step workflow, you need 99.5% per-step accuracy. That number should inform every design decision.

The difference between a system that can reason about a problem and a system that can survive one is where AI reliability stands today. Most teams are investing heavily in the first half and largely ignoring the second.

References