Patch Changes
-
#1684
ab6dd95Thanks @threepointone! - warn whenchatRecoveryis configured inonStart()(applied too late for wake recovery)On every Durable Object wake the SDK evaluates chat-recovery budgets — and may seal an interrupted turn, firing
onExhausted— before the user'sonStart()runs (_checkRunFibers()is ordered ahead ofonStart()). AchatRecoveryconfig produced insideonStart()is therefore read as the built-in defaults at the moment recovery decides, so a configuredmaxRecoveryWork/shouldKeepRecovering/onExhaustedsilently never applies to the recovery that matters.This is now documented on
ChatRecoveryConfigand thechatRecoveryfields ofThink/AIChatAgent, and the SDK logs a one-time warning if it detectschatRecoverybeing reassigned duringonStart(). The warning fires both for a custom config object and forchatRecovery = true(enabling recovery / its defaults too late); assigningfalse(disabling) inonStart()is intentionally not warned, since recovery already ran with the pre-onStart()value and disabling it afterward is a benign no-op for that wake. The fix is to assignchatRecoveryas a class field or in the constructor. -
#1684
ab6dd95Thanks @threepointone! - fix(chat-recovery): don't seal a human-in-the-loop turn that is waiting on a pending client tool callA turn parked on a pending CLIENT interaction — an
input-availableclient-tool part (no serverexecute) or anapproval-requestedpart, as detected byhasPendingInteraction()— is waiting on the human, not stuck. After a mid-turn Durable Object restart (e.g. a deploy), the in-memory pending-interaction promise is gone, sowaitUntilStable()repeatedly times out until the client reconnects and replays the tool-result/approval. That replay drives a fresh continuation via the auto-continuation barrier independently of recovery — but the recovery loop was treating those timeouts as deploy churn:- each stable-state timeout burned a recovery attempt, eventually sealing a perfectly healthy turn with
reason="stable_timeout", and - the no-progress window (which never advances while no content is produced) could seal it with
reason="no_progress_timeout"once it elapsed.
The net effect: an interrupted human-in-the-loop turn whose user simply took longer than the configured
noProgressTimeoutMs/ attempt budget to answer a tool prompt was terminalized with a "session interrupted" banner, even though nothing had actually failed.While a client interaction is pending the turn is now budget-free:
_beginChatRecoveryIncidentsuppresses the no-progress window, attempt cap, work budget, andshouldKeepRecoveringpredicate, and keeps the no-progress clock fresh so the turn gets a full window once the human finally answers._chatRecoveryContinue/_chatRecoveryRetrypark (mark the incidentskippedwithreason="awaiting_client_interaction", resolving the live "recovering…" indicator) instead of rescheduling or exhausting — the client's eventual replay resumes the turn. A client that never returns is reclaimed by the incident TTL sweep and DO idle-eviction.
In
@cloudflare/think, asubmitMessages-backed turn additionally has its durable submission row completed at park time. The recovery loop is that row's sole completion driver after a restart, and the client's replay resumes the conversation as an independent auto-continuation that never touches the submission — so parking without completing would leave the rowrunning, and the next restart's_recoverSubmissionsOnStartwould sweep it toerror(a false "session recovery error"). The park condition is a fully-materialized client tool call in the leaf, which is exactly the terminal state a non-interrupted submission reaches when its step emits a client tool call (the model does not block on client tools), socompletedis the correct, consistent outcome.SERVER-tool orphans are deliberately excluded (their
execute()died with the isolate and nothing will resolve them), so they still recover normally via the transcript-repair pass.Both
@cloudflare/thinkand@cloudflare/ai-chat(which carries its own copy of the recovery engine) are fixed. In@cloudflare/thinkthe client/server distinction already lived inhasPendingInteraction().@cloudflare/ai-chat'shasPendingInteraction()(used bywaitUntilStable) does not distinguish client from server tools, so a new, narrower client-only predicatehasPendingClientInteraction()was added there and gates the exemption — leavingwaitUntilStable's existing behavior untouched so server-tool orphans keep reschedule/exhaust semantics.The exemption depends on knowing the request's client tools.
@cloudflare/ai-chatrestores them in its constructor, so they are available when boot recovery evaluates budgets.@cloudflare/thinkrestored them inonStart(), which the baseAgentruns after the boot-recovery path (_handleInternalFiberRecovery->_beginChatRecoveryIncident) — so on a fresh wake the in-memory cache was still empty and a client-toolinput-availableorphan re-detected past the no-progress window was misread as "stuck" and wrongly sealed._beginChatRecoveryIncidentnow re-hydrates_lastClientToolsfrom the durablethink_configstore before evaluating the budget, closing that hibernation-ordering hole (approval-requestedturns were never affected, since that branch does not depend on the client tool set). - each stable-state timeout burned a recovery attempt, eventually sealing a perfectly healthy turn with
-
#1672
f96a2baThanks @threepointone! - fix(chat-recovery): a turn making forward progress now survives unbounded deploy churn; add a work budget +shouldKeepRecoveringrunaway guardDurable chat recovery used to bound a single incident with a non-resetting 15-minute wall-clock ceiling (
CHAT_RECOVERY_MAX_WINDOW_MS). That ceiling was overloaded — it served as both a recovery-duration bound and a runaway-loop guard — and it terminated healthy, actively-progressing turns that simply took longer than 15 minutes of wall-clock to finish while being repeatedly interrupted by a dense deploy window, sealing them withreason="max_recovery_window_exceeded"and discarding completed work.The two jobs are now decoupled (see
design/rfc-chat-recovery-work-budget.md):- Duration is no longer a bound for a progressing turn. The non-resetting wall-clock ceiling is removed. A turn that keeps producing content survives unbounded deploy churn. Stuck turns are still sealed by the no-progress window (5 min, resets on progress); tight no-progress alarm loops by the attempt cap.
- New runaway-loop guard, keyed to work, not time. The existing durable, monotonic, reconnect-immune progress counter is reused as a work meter.
chatRecovery.maxRecoveryWorkcaps the produced content/tool units since an incident opened; exceeding it seals withreason="work_budget_exceeded". Defaults toInfinity— the SDK ships the mechanism but imposes no implicit cap, so it never terminates a progressing turn on its own. - New caller predicate.
chatRecovery.shouldKeepRecovering(ctx)is consulted per recovery attempt from the second onward (only when no hard bound has already sealed the incident); returningfalseseals withreason="recovery_aborted". This is where integrators express token/cost/step budgets the SDK should not hardcode. A throwing predicate is logged and treated as "keep recovering". - The no-progress timeout is now configurable.
chatRecovery.noProgressTimeoutMs(default 5 min, resets on progress) is the primary stuck-turn bound, now overridable per agent instead of a hardcoded constant.
New public types from
agents/chat:ChatRecoveryProgressContext. NewChatRecoveryConfigfields:maxRecoveryWork,shouldKeepRecovering,noProgressTimeoutMs.ChatRecoveryExhaustedContext.reasongainswork_budget_exceededandrecovery_aborted;max_recovery_window_exceededis retained as an open-string value but is no longer emitted.Both
@cloudflare/ai-chatand@cloudflare/think(which carries its own copy of the recovery engine) are updated identically. Defaults are unchanged except that a progressing turn is no longer terminated by wall-clock age. -
#1670
5d64940Thanks @threepointone! - Fix: a deploy that interrupts an in-flightrunAgentToolchild no longer abandons the still-running child asinterrupted.Parent recovery re-attaches to a still-running child and tails it to its real terminal. Previously that re-attach used a flat 120s wall-clock budget that was not reset by the child's forward progress, so a healthy child whose recovery legitimately ran longer than the budget was sealed
interrupted(and its already-completed work re-run from scratch), even while it was actively streaming.The re-attach budget is now progress-keyed: it bounds how long the parent waits with no forward progress from the child (resetting on every forwarded chunk), so a genuinely hung/silent child still seals
interruptedafter one no-progress window and can never block recovery forever, while a healthy child that keeps streaming is followed through to terminal. The parent re-arms (opens a fresh tail) only when the child's stream closes cleanly while it is still advancing — i.e. a re-evicted-but-progressing child. A full no-progress window (the child went silent) sealsno-progressimmediately even if the child streamed earlier in that window; it no longer grants a bonus window. This is both the honest stall signal and what keeps at most one pending tail reader alive per re-attach (no per-cycle reader accumulation).@cloudflare/thinkand@cloudflare/ai-chatadditionally finalize a child facet's own agent-tool run row as soon as its recovered turn settles — regardless of whether recovery took the continue path (_chatRecoveryContinue) or the pre-stream retry path (_chatRecoveryRetry) — so a re-attached parent collects the terminal result immediately instead of waiting out a full no-progress window after the child has already finished.This release also adds:
- Typed interrupted cause.
RunAgentToolResult, theagentTool()AgentToolFailureenvelope, theonAgentToolFinishlifecycle result, and theagent-tool-eventwire event (kind"interrupted") now carry a machine-readablereason(AgentToolInterruptedReason:"no-progress" | "window-exceeded" | "not-tailable" | "inspect-timeout" | "inspect-failed" | "recovery-deadline") and achildStillRunningboolean oninterruptedresults, so callers (and UIs) can branch on why a run was abandoned (and whether the child is still running) instead of pattern-matching the human-readableerrorprose.retryablestays coarse (alwaystrueforinterrupted); refine withreason/childStillRunning. These fields are persisted (schema bump), so they survive a reconnect replay — a client that reconnects after an interrupt reconstructs the samereason/childStillRunninga live client saw, rather thanundefined. The persisted cause is cleared when a softinterruptedrow is later repaired tocompleted/error. - Configurable re-attach budgets. Two new public
AgentStaticOptions—agentToolReattachNoProgressTimeoutMs(default 120000, the progress-keyed no-progress budget) andagentToolReattachMaxWindowMs(defaultInfinity— no implicit wall-clock cap) — let an Agent tune re-attach. The hard ceiling defaults to uncapped to mirror chat-recovery'smaxRecoveryWork: Infinity: a re-attached parent follows a healthy, still-advancing child for as long as it makes progress — exactly as it would on the live (never-evicted) path — so it never abandons a long-running-but-healthy child that simply outlasts a fixed wall clock under deploy churn. A hung/silent child is bounded by the no-progress budget; a content-runaway is bounded uniformly (live and recovery) by the child's ownmaxRecoveryWork/shouldKeepRecovering. Integrators that want a hard wall-clock cap (and thewindow-exceededchild teardown it triggers) can setagentToolReattachMaxWindowMsto a finite value. Symmetrically, settingagentToolReattachNoProgressTimeoutMstoInfinitynow means "never seal on no-progress" (a silent-but-alive child is followed until its stream closes or the hard ceiling fires) instead of silently skipping the wait —0remains the "don't wait, collect only an already-terminal child" sentinel. - Give-up teardown (ceiling only). When the parent gives up at the hard
window-exceededceiling — where the child has had its full recovery window and is truly exhausted — it now cancels the child (childStillRunning: false) so it stops consuming a fiber / keep-alive.no-progressgive-ups stay soft (childStillRunning: true): the child is left running so a re-issue can still re-attach and repair it if it self-heals, preserving the repair-on-re-issue path. In both@cloudflare/thinkand@cloudflare/ai-chat,cancelAgentToolRunalso aborts an in-flight chat-recovery turn (not just the original in-isolate run) and releases live tails — Think sweeps its_submissionAbortControllers, ai-chat its requestAbortRegistry(abortAllRequests) — so a torn-down child stops grinding instead of finishing an orphaned recovered turn.
- Typed interrupted cause.