Patch Changes
-
#1826
1bbd9bcThanks @threepointone! - Add a tight, OOM-specific retry budget to chat recovery so a memory-limit crash loop seals fast and attributably (#1825).When a recovery turn hits a Durable Object memory-limit reset (the isolate exceeded its 128 MB limit), recovery now classifies it as a distinct, deterministic failure rather than a deploy-style transient. A memory reset re-OOMs on re-run (the turn's working set, not the platform, is the cause), so it must NOT be deferred and retried forever like a code-update/connection-lost transient. Each such crash bumps a durable per-incident
oomAttemptscounter; recovery retries a small number of times (newchatRecovery.maxOomRetries, default3) — in case the OOM was a transient spike — then seals withreason="out_of_memory". This is far tighter than the genericmaxRecoveryWorkbackstop because an OOM is attributable and each re-run re-runs the model.This complements the finite
maxRecoveryWorkdefault: the OOM budget is the fast path for memory resets that surface as catchable errors thrown from recovery bookkeeping (e.g. storage/SQL rejections after the reset), whilemaxRecoveryWorkremains a backstop for the hard-kill case where no in-isolate code runs to record the OOM.Adds an alarm-boundary circuit breaker (
agents) as the universal backstop for the case the in-DO budgets can't catch (#1825): a memory-limit reset that bypasses them entirely — thrown before the budget code runs (e.g. boot-time state hydration OOMs), or whose own small writes also OOM under memory pressure. Left unhandled, such an error propagates out ofalarm()and the platform auto-retries the alarm forever, re-running the doomed, billable turn each cycle.Agent.alarm()now intercepts ONLY Durable Object memory-limit resets at the outermost frame — where the heavy turn has unwound and GC has reclaimed its footprint, so the seal/purge writes can land where mid-turn ones OOMed. A durable strike counter tolerates a few resets (newstatic options.maxAlarmMemoryLimitStrikes, default3) — backing off the looping rows so the retry is not a hot loop — then seals the recovery (out_of_memory) and surgically purges only the looping schedule rows, leaving unrelated scheduled tasks intact. A newalarm:memory_limit_resetobservability event is emitted. Everything except memory-limit resets re-throws exactly as before.Also broadens and exports the
isDurableObjectMemoryLimitReset(error)predicate fromagents(a sibling toisDurableObjectCodeUpdateReset/isPlatformTransientError): it now matches the shared"exceeded its memory limit"fragment so truncated/reworded surfacings (observed in real #1825 logs) still classify. -
#1826
1bbd9bcThanks @threepointone! - Fix neverending chat-recovery retries when a Durable Object isolate runs out of memory mid-turn (#1825).chatRecovery.maxRecoveryWorknow defaults to a generous finite backstop (1000) instead ofInfinity. An isolate that exceeds its memory limit and is reset mid-stream has usually already streamed a little content, which bumps the durable progress counter. On the next wake recovery reads that as forward progress and resets both progress-keyed bounds — the attempt cap (maxAttempts) and the no-progress window (noProgressTimeoutMs) — and because each crash lands inside the alarm-debounce window the attempt counter is pinned too. With the work budget disabled (Infinity), no instrument could ever seal the turn, so recovery re-ran the turn (and its LLM calls) forever. The work meter is the one signal that keeps climbing across such a loop, so a finite default seals a runaway withreason="work_budget_exceeded"instead of looping.Work only accrues from the first interruption until the turn completes, so a normal interrupted turn never approaches the cap. A very long agentic turn that legitimately produces a large amount of content under heavy interruption can raise
maxRecoveryWork(or set it toInfinityto restore the previous fully-unbounded behavior, ideally paired with ashouldKeepRecoveringpredicate that bounds the runaway via real token/cost accounting).