AWS Lambda kills any process that runs longer than 900 seconds. If your research agent hits minute 15 mid-synthesis, the runtime hard-kills the container, the in-memory context evaporates, and the $4.50 of compute you just spent on 40,000 tokens of scraped and summarized content becomes a 500 error to the user and a line item on your cloud bill.
This is not an edge case. It is the central architectural mismatch of 2026: production AI agents are stateful, long-running processes that routinely exceed the wall-clock limits of the serverless substrate most teams default to. The fix is not a longer timeout — it is a different execution model entirely.
The 15-Minute Wall
AWS Lambda caps execution at 15 minutes (900 seconds), with a default of 3 seconds that catches teams off guard in their first production deploy (AWS Lambda timeout documentation). Platforms like Modal support arbitrarily long-running processes with sub-second cold starts, but Lambda — the default compute choice for most serverless-first teams — imposes the hardest limit in the market.
A request/response LLM call fits comfortably inside 15 minutes. A multi-step agent does not. A research agent that scrapes twenty pages, summarizes each, cross-references claims, and drafts a report routinely runs 30 to 90 minutes. A code-review agent that pulls a diff, spins up a sandbox, runs the test suite, and writes findings can take an hour. None of these fit in Lambda’s budget.
The instinct is to bump the timeout and move on. That works until the first production crash.
Why Agents Break the Request Model
Modern web infrastructure — REST handlers, Lambda functions, containers behind a load balancer — is optimized for short-lived, stateless operations. A request arrives, work happens, a response leaves, the process dies. State, if any, lives in a database that the next request reads back.
Agents violate this contract in three ways simultaneously (MMNTM Research):
- Duration: agent tasks take minutes, hours, or days, especially when waiting for human approval mid-workflow.
- State is heavy: the working context — scraped documents, intermediate summaries, tool outputs — is expensive to recompute and cannot be cheaply serialized into a session row.
- Failure is frequent: agents depend on unreliable third-party surfaces — search APIs, headless browsers, external model endpoints — that fail independently and at rates far higher than a typical database read.
Build an agent as a Python loop in a container and you have built a fragile system. Any interruption — a Kubernetes node rebalance, a deploy, a rate-limit timeout — kills the agent and the accumulated state with it.
The Restart Tax: Wasted Compute
The financial dimension of this fragility has a name: the Restart Tax. MMNTM frames it precisely: a 15-minute agent task that crashes at 99% completion wastes $4.50 in compute, because the naive retry re-executes every prior step (MMNTM Research).
In the request/response era, retries were cheap — a failed API call cost milliseconds. In the long-running-agent era, retries are prohibitively expensive. At GPU cloud rates of $5 to $15 per card-hour, an uncoordinated retry storm on a handful of failing workflows can drain a budget before the underlying bug is even identified (Spheron).
There is a second, subtler cost: duplicate side effects. When a workflow crashes mid-activity, the question is not just “do I restart?” but “did the tool call execute before the crash?” If the call landed but the acknowledgment was lost, a naive retry sends it twice. For idempotent reads this is harmless. For writes — database updates, payment calls, model training job submissions — duplicate execution corrupts state in ways that surface hours later.
Durable Execution: Three Engines
The architectural answer is durable execution: a class of workflow engine that journals every step so the agent can resume from the exact point of failure, regardless of what crashed. Three engines dominate production use in 2026, each with a distinct journaling model (Spheron):
| Engine | Journal model | Strength | Trade-off |
|---|---|---|---|
| Temporal | History-based event replay | Mature, battle-tested at Netflix scale | Operational complexity, long-running workers |
| Inngest | Step-level checkpointing | Developer ergonomics, serverless-native | Newer ecosystem, fewer large deployments |
| Restate | Virtual objects, journaled invocations | Exactly-once semantics, embedded runtime | Smallest community of the three |
Temporal records a full event history — activity scheduled, started, completed — and replays the workflow code deterministically on recovery, returning cached results for completed steps without re-invoking the underlying APIs. Netflix runs hundreds of thousands of Temporal workflows per day (MMNTM Research).
Inngest checkpoints the output of each step.run() call and re-invokes the function on failure, replaying cached step results. Restate assigns each virtual object a unique key (a session ID), serializes all handler calls on that object, and journals every ctx.run() before execution, guaranteeing exactly-once semantics even across crashes.
DBOS takes a different cut: it persists workflow state directly in Postgres, using the database as the durable substrate rather than a separate journal service (DBOS). For teams already operating Postgres at scale, this eliminates an additional distributed system to run.
Serverless Workers: The 2026 Shift
Temporal’s traditional weakness was operational: it required always-on Worker processes that continuously polled task queues, burning compute even when idle. For bursty, event-driven agent workloads, that meant sizing infrastructure for peak load and paying for it around the clock.
At Replay 2026, Temporal closed this gap. Serverless Workers — now in pre-release — deploy Worker logic directly to AWS Lambda, with Temporal Cloud handling invocation, autoscaling, and scale-to-zero (byteiota). When the task queue is empty, nothing runs. When a burst arrives, Lambda scales to match. When each invocation finishes, it shuts down.
The cold-start objection dissolves on inspection. Lambda takes 200 to 500 milliseconds to initialize a new execution environment. A single LLM call takes one to thirty seconds. For multi-step agent workflows running for minutes, the cold start is statistical noise (byteiota).
A second Replay 2026 release, Workflow Streams, pushes token batches and status updates back to the caller while a workflow runs, using Temporal’s Signal and Update primitives. This closes the last gap: streaming AI responses through a durable orchestration layer without breaking the execution guarantee.
When You Don’t Need Durability
Durable execution is not free. It adds a coordination service to operate, a journal to store, and determinism constraints on workflow code. Not every workload justifies it.
A single-shot LLM call with no tool use does not need it. A stateless function-call proxy does not need it. A batch job that can be safely re-run from scratch does not need it. The decision point is simple: if a crash mid-execution would force you to re-spend money, re-execute side effects, or lose irrecoverable state, you need durable execution. If a retry is effectively free, you do not.
The teams getting this wrong in 2026 are not the ones skipping durability — they are the ones applying it to trivially restartable work and then wondering why their infra bill doubled. Match the execution model to the actual cost of failure.
The Hard Truth
Serverless was built for stateless request handling. Agents are stateful, long-running, and failure-prone. Pretending the second fits inside the first is how you ship an agent that works in the demo and burns money in production. The infrastructure to solve this exists — Temporal, Inngest, Restate, DBOS — and as of 2026 it no longer requires always-on servers to run. The remaining excuse for shipping fragile agents is gone.