Patch Changes
-
#1684
ab6dd95Thanks @threepointone! - warn whenchatRecoveryis configured inonStart()(applied too late for wake recovery)On every Durable Object wake the SDK evaluates chat-recovery budgets — and may seal an interrupted turn, firing
onExhausted— before the user'sonStart()runs (_checkRunFibers()is ordered ahead ofonStart()). AchatRecoveryconfig produced insideonStart()is therefore read as the built-in defaults at the moment recovery decides, so a configuredmaxRecoveryWork/shouldKeepRecovering/onExhaustedsilently never applies to the recovery that matters.This is now documented on
ChatRecoveryConfigand thechatRecoveryfields ofThink/AIChatAgent, and the SDK logs a one-time warning if it detectschatRecoverybeing reassigned duringonStart(). The warning fires both for a custom config object and forchatRecovery = true(enabling recovery / its defaults too late); assigningfalse(disabling) inonStart()is intentionally not warned, since recovery already ran with the pre-onStart()value and disabling it afterward is a benign no-op for that wake. The fix is to assignchatRecoveryas a class field or in the constructor. -
#1684
ab6dd95Thanks @threepointone! - fix(chat-recovery): don't seal a human-in-the-loop turn that is waiting on a pending client tool callA turn parked on a pending CLIENT interaction — an
input-availableclient-tool part (no serverexecute) or anapproval-requestedpart, as detected byhasPendingInteraction()— is waiting on the human, not stuck. After a mid-turn Durable Object restart (e.g. a deploy), the in-memory pending-interaction promise is gone, sowaitUntilStable()repeatedly times out until the client reconnects and replays the tool-result/approval. That replay drives a fresh continuation via the auto-continuation barrier independently of recovery — but the recovery loop was treating those timeouts as deploy churn:- each stable-state timeout burned a recovery attempt, eventually sealing a perfectly healthy turn with
reason="stable_timeout", and - the no-progress window (which never advances while no content is produced) could seal it with
reason="no_progress_timeout"once it elapsed.
The net effect: an interrupted human-in-the-loop turn whose user simply took longer than the configured
noProgressTimeoutMs/ attempt budget to answer a tool prompt was terminalized with a "session interrupted" banner, even though nothing had actually failed.While a client interaction is pending the turn is now budget-free:
_beginChatRecoveryIncidentsuppresses the no-progress window, attempt cap, work budget, andshouldKeepRecoveringpredicate, and keeps the no-progress clock fresh so the turn gets a full window once the human finally answers._chatRecoveryContinue/_chatRecoveryRetrypark (mark the incidentskippedwithreason="awaiting_client_interaction", resolving the live "recovering…" indicator) instead of rescheduling or exhausting — the client's eventual replay resumes the turn. A client that never returns is reclaimed by the incident TTL sweep and DO idle-eviction.
In
@cloudflare/think, asubmitMessages-backed turn additionally has its durable submission row completed at park time. The recovery loop is that row's sole completion driver after a restart, and the client's replay resumes the conversation as an independent auto-continuation that never touches the submission — so parking without completing would leave the rowrunning, and the next restart's_recoverSubmissionsOnStartwould sweep it toerror(a false "session recovery error"). The park condition is a fully-materialized client tool call in the leaf, which is exactly the terminal state a non-interrupted submission reaches when its step emits a client tool call (the model does not block on client tools), socompletedis the correct, consistent outcome.SERVER-tool orphans are deliberately excluded (their
execute()died with the isolate and nothing will resolve them), so they still recover normally via the transcript-repair pass.Both
@cloudflare/thinkand@cloudflare/ai-chat(which carries its own copy of the recovery engine) are fixed. In@cloudflare/thinkthe client/server distinction already lived inhasPendingInteraction().@cloudflare/ai-chat'shasPendingInteraction()(used bywaitUntilStable) does not distinguish client from server tools, so a new, narrower client-only predicatehasPendingClientInteraction()was added there and gates the exemption — leavingwaitUntilStable's existing behavior untouched so server-tool orphans keep reschedule/exhaust semantics.The exemption depends on knowing the request's client tools.
@cloudflare/ai-chatrestores them in its constructor, so they are available when boot recovery evaluates budgets.@cloudflare/thinkrestored them inonStart(), which the baseAgentruns after the boot-recovery path (_handleInternalFiberRecovery->_beginChatRecoveryIncident) — so on a fresh wake the in-memory cache was still empty and a client-toolinput-availableorphan re-detected past the no-progress window was misread as "stuck" and wrongly sealed._beginChatRecoveryIncidentnow re-hydrates_lastClientToolsfrom the durablethink_configstore before evaluating the budget, closing that hibernation-ordering hole (approval-requestedturns were never affected, since that branch does not depend on the client tool set). - each stable-state timeout burned a recovery attempt, eventually sealing a perfectly healthy turn with
-
#1672
f96a2baThanks @threepointone! - fix(chat-recovery): a turn making forward progress now survives unbounded deploy churn; add a work budget +shouldKeepRecoveringrunaway guardDurable chat recovery used to bound a single incident with a non-resetting 15-minute wall-clock ceiling (
CHAT_RECOVERY_MAX_WINDOW_MS). That ceiling was overloaded — it served as both a recovery-duration bound and a runaway-loop guard — and it terminated healthy, actively-progressing turns that simply took longer than 15 minutes of wall-clock to finish while being repeatedly interrupted by a dense deploy window, sealing them withreason="max_recovery_window_exceeded"and discarding completed work.The two jobs are now decoupled (see
design/rfc-chat-recovery-work-budget.md):- Duration is no longer a bound for a progressing turn. The non-resetting wall-clock ceiling is removed. A turn that keeps producing content survives unbounded deploy churn. Stuck turns are still sealed by the no-progress window (5 min, resets on progress); tight no-progress alarm loops by the attempt cap.
- New runaway-loop guard, keyed to work, not time. The existing durable, monotonic, reconnect-immune progress counter is reused as a work meter.
chatRecovery.maxRecoveryWorkcaps the produced content/tool units since an incident opened; exceeding it seals withreason="work_budget_exceeded". Defaults toInfinity— the SDK ships the mechanism but imposes no implicit cap, so it never terminates a progressing turn on its own. - New caller predicate.
chatRecovery.shouldKeepRecovering(ctx)is consulted per recovery attempt from the second onward (only when no hard bound has already sealed the incident); returningfalseseals withreason="recovery_aborted". This is where integrators express token/cost/step budgets the SDK should not hardcode. A throwing predicate is logged and treated as "keep recovering". - The no-progress timeout is now configurable.
chatRecovery.noProgressTimeoutMs(default 5 min, resets on progress) is the primary stuck-turn bound, now overridable per agent instead of a hardcoded constant.
New public types from
agents/chat:ChatRecoveryProgressContext. NewChatRecoveryConfigfields:maxRecoveryWork,shouldKeepRecovering,noProgressTimeoutMs.ChatRecoveryExhaustedContext.reasongainswork_budget_exceededandrecovery_aborted;max_recovery_window_exceededis retained as an open-string value but is no longer emitted.Both
@cloudflare/ai-chatand@cloudflare/think(which carries its own copy of the recovery engine) are updated identically. Defaults are unchanged except that a progressing turn is no longer terminated by wall-clock age. -
#1668
d40cc8aThanks @ghostwriternr! - Fix RPC resource leaks in workflows.Workflows that use
waitForApproval()orThinkWorkflow.prompt()now release their RPC stubs promptly, preventing resource leaks and the associated "RPC stub was not disposed" warnings in your logs. -
#1670
5d64940Thanks @threepointone! - Fix: a deploy that interrupts an in-flightrunAgentToolchild no longer abandons the still-running child asinterrupted.Parent recovery re-attaches to a still-running child and tails it to its real terminal. Previously that re-attach used a flat 120s wall-clock budget that was not reset by the child's forward progress, so a healthy child whose recovery legitimately ran longer than the budget was sealed
interrupted(and its already-completed work re-run from scratch), even while it was actively streaming.The re-attach budget is now progress-keyed: it bounds how long the parent waits with no forward progress from the child (resetting on every forwarded chunk), so a genuinely hung/silent child still seals
interruptedafter one no-progress window and can never block recovery forever, while a healthy child that keeps streaming is followed through to terminal. The parent re-arms (opens a fresh tail) only when the child's stream closes cleanly while it is still advancing — i.e. a re-evicted-but-progressing child. A full no-progress window (the child went silent) sealsno-progressimmediately even if the child streamed earlier in that window; it no longer grants a bonus window. This is both the honest stall signal and what keeps at most one pending tail reader alive per re-attach (no per-cycle reader accumulation).@cloudflare/thinkand@cloudflare/ai-chatadditionally finalize a child facet's own agent-tool run row as soon as its recovered turn settles — regardless of whether recovery took the continue path (_chatRecoveryContinue) or the pre-stream retry path (_chatRecoveryRetry) — so a re-attached parent collects the terminal result immediately instead of waiting out a full no-progress window after the child has already finished.This release also adds:
- Typed interrupted cause.
RunAgentToolResult, theagentTool()AgentToolFailureenvelope, theonAgentToolFinishlifecycle result, and theagent-tool-eventwire event (kind"interrupted") now carry a machine-readablereason(AgentToolInterruptedReason:"no-progress" | "window-exceeded" | "not-tailable" | "inspect-timeout" | "inspect-failed" | "recovery-deadline") and achildStillRunningboolean oninterruptedresults, so callers (and UIs) can branch on why a run was abandoned (and whether the child is still running) instead of pattern-matching the human-readableerrorprose.retryablestays coarse (alwaystrueforinterrupted); refine withreason/childStillRunning. These fields are persisted (schema bump), so they survive a reconnect replay — a client that reconnects after an interrupt reconstructs the samereason/childStillRunninga live client saw, rather thanundefined. The persisted cause is cleared when a softinterruptedrow is later repaired tocompleted/error. - Configurable re-attach budgets. Two new public
AgentStaticOptions—agentToolReattachNoProgressTimeoutMs(default 120000, the progress-keyed no-progress budget) andagentToolReattachMaxWindowMs(defaultInfinity— no implicit wall-clock cap) — let an Agent tune re-attach. The hard ceiling defaults to uncapped to mirror chat-recovery'smaxRecoveryWork: Infinity: a re-attached parent follows a healthy, still-advancing child for as long as it makes progress — exactly as it would on the live (never-evicted) path — so it never abandons a long-running-but-healthy child that simply outlasts a fixed wall clock under deploy churn. A hung/silent child is bounded by the no-progress budget; a content-runaway is bounded uniformly (live and recovery) by the child's ownmaxRecoveryWork/shouldKeepRecovering. Integrators that want a hard wall-clock cap (and thewindow-exceededchild teardown it triggers) can setagentToolReattachMaxWindowMsto a finite value. Symmetrically, settingagentToolReattachNoProgressTimeoutMstoInfinitynow means "never seal on no-progress" (a silent-but-alive child is followed until its stream closes or the hard ceiling fires) instead of silently skipping the wait —0remains the "don't wait, collect only an already-terminal child" sentinel. - Give-up teardown (ceiling only). When the parent gives up at the hard
window-exceededceiling — where the child has had its full recovery window and is truly exhausted — it now cancels the child (childStillRunning: false) so it stops consuming a fiber / keep-alive.no-progressgive-ups stay soft (childStillRunning: true): the child is left running so a re-issue can still re-attach and repair it if it self-heals, preserving the repair-on-re-issue path. In both@cloudflare/thinkand@cloudflare/ai-chat,cancelAgentToolRunalso aborts an in-flight chat-recovery turn (not just the original in-isolate run) and releases live tails — Think sweeps its_submissionAbortControllers, ai-chat its requestAbortRegistry(abortAllRequests) — so a torn-down child stops grinding instead of finishing an orphaned recovered turn.
- Typed interrupted cause.
-
#1675
d915bc6Thanks @threepointone! - The skill runner now importsjust-bashand@cloudflare/codemodestatically instead of dynamically, and both have moved from optional peer dependencies to regular dependencies ofagents. The dynamic imports were ineffective in bundled Workers (the bundler includes them eagerly regardless) and triggeredINEFFECTIVE_DYNAMIC_IMPORTwarnings when bundled alongside@cloudflare/think, which imports them statically.@cloudflare/thinkalso now statically imports its internalExtensionManagerinstead of dynamically, removing the third such warning. -
#1662
df6c0d6Thanks @threepointone! - Add opt-in recovery for mid-turn context-window overflow.Compaction only fires between turns (
Session.compactAfterchecks the threshold onappendMessage). A single long, tool-heavy turn grows the prompt step-by-step inside onestreamTextloop and can exceed the model's context window mid-turn, before the next pre-turn check — the provider then 400s ("prompt is too long"/context_length_exceeded) and the turn dies terminally. Think deliberately ships no provider-specific error matching, so it could neither detect nor recover from this.This adds opt-in, provider-agnostic recovery (all default off — no behavior change unless enabled), configured through a single
contextOverflowproperty onThink:classifyChatError(error, ctx)— the app maps a raw error (or the in-stream error string) to aChatErrorClassification("context_overflow" | "rate_limit" | "transient" | "fatal" | "unknown"). Same framework-owns-the-mechanism / app-owns-the-provider-knowledge split astokenCounter. The classification is also threaded toonChatError/observers viaChatErrorContext.classification. The bundled, exporteddefaultContextOverflowClassifiercovers the common providers (Anthropic, OpenAI, Google, Bedrock, …) for apps that do not need custom classification.contextOverflow.reactive+contextOverflow.maxRetries— when a turn fails with acontext_overflowthe app classified, Think discards the truncated partial, runssession.compact(), and re-runs the turn (bounded) from the compacted history instead of dying. The partial is intentionally not persisted: the retry restarts the turn from scratch, so keeping the cut-off partial would orphan a half-finished assistant message beside the recovered answer (and duplicate any tool work the retry re-issues). A no-op compaction or a spent budget surfaces the overflow terminally throughonChatErrorwithclassification: "context_overflow"— never a silent end, never an infinite loop. Wired into the WebSocket,chat()/RPC, and programmatic (saveMessages/submitMessages) turn paths.contextOverflow.proactive— a{ maxInputTokens, headroom?, maxCompactions? }pre-step guard: when the previous step's model-reportedusage.inputTokenscrossesmaxInputTokens * (headroom ?? 0.9), Think compacts in place and feeds the recompacted history into the upcoming step, heading off the provider 400 before it happens. Keys off model-reported usage (every provider reports it), not provider error strings. Bounded per step loop by its ownmaxCompactions(default 1, independent of the reactivemaxRetriesbudget).
Also adds a
chat:context:compactedobservability event (agents) emitted (once) on both proactive and reactive compaction.Notes:
- Provider context-overflow errors always surface as in-stream error parts (confirmed against the AI SDK:
streamTextre-enqueues even top-level rejections as{ type: "error" }fullStream parts, andtoUIMessageStreampasses them through without throwing), so the in-stream seam catches them on every path; the thrown-error catch path does not need separate wiring. - Recovery effectiveness depends on the app's compaction config — a no-op compaction cannot rescue an over-budget turn (handled gracefully: terminal, not a loop). A one-time warning fires if
contextOverflow.reactiveis enabled butclassifyChatErrorwas never overridden.