Patch Changes
-
#1684
ab6dd95Thanks @threepointone! - warn whenchatRecoveryis configured inonStart()(applied too late for wake recovery)On every Durable Object wake the SDK evaluates chat-recovery budgets — and may seal an interrupted turn, firing
onExhausted— before the user'sonStart()runs (_checkRunFibers()is ordered ahead ofonStart()). AchatRecoveryconfig produced insideonStart()is therefore read as the built-in defaults at the moment recovery decides, so a configuredmaxRecoveryWork/shouldKeepRecovering/onExhaustedsilently never applies to the recovery that matters.This is now documented on
ChatRecoveryConfigand thechatRecoveryfields ofThink/AIChatAgent, and the SDK logs a one-time warning if it detectschatRecoverybeing reassigned duringonStart(). The warning fires both for a custom config object and forchatRecovery = true(enabling recovery / its defaults too late); assigningfalse(disabling) inonStart()is intentionally not warned, since recovery already ran with the pre-onStart()value and disabling it afterward is a benign no-op for that wake. The fix is to assignchatRecoveryas a class field or in the constructor. -
#1672
f96a2baThanks @threepointone! - fix(chat-recovery): a turn making forward progress now survives unbounded deploy churn; add a work budget +shouldKeepRecoveringrunaway guardDurable chat recovery used to bound a single incident with a non-resetting 15-minute wall-clock ceiling (
CHAT_RECOVERY_MAX_WINDOW_MS). That ceiling was overloaded — it served as both a recovery-duration bound and a runaway-loop guard — and it terminated healthy, actively-progressing turns that simply took longer than 15 minutes of wall-clock to finish while being repeatedly interrupted by a dense deploy window, sealing them withreason="max_recovery_window_exceeded"and discarding completed work.The two jobs are now decoupled (see
design/rfc-chat-recovery-work-budget.md):- Duration is no longer a bound for a progressing turn. The non-resetting wall-clock ceiling is removed. A turn that keeps producing content survives unbounded deploy churn. Stuck turns are still sealed by the no-progress window (5 min, resets on progress); tight no-progress alarm loops by the attempt cap.
- New runaway-loop guard, keyed to work, not time. The existing durable, monotonic, reconnect-immune progress counter is reused as a work meter.
chatRecovery.maxRecoveryWorkcaps the produced content/tool units since an incident opened; exceeding it seals withreason="work_budget_exceeded". Defaults toInfinity— the SDK ships the mechanism but imposes no implicit cap, so it never terminates a progressing turn on its own. - New caller predicate.
chatRecovery.shouldKeepRecovering(ctx)is consulted per recovery attempt from the second onward (only when no hard bound has already sealed the incident); returningfalseseals withreason="recovery_aborted". This is where integrators express token/cost/step budgets the SDK should not hardcode. A throwing predicate is logged and treated as "keep recovering". - The no-progress timeout is now configurable.
chatRecovery.noProgressTimeoutMs(default 5 min, resets on progress) is the primary stuck-turn bound, now overridable per agent instead of a hardcoded constant.
New public types from
agents/chat:ChatRecoveryProgressContext. NewChatRecoveryConfigfields:maxRecoveryWork,shouldKeepRecovering,noProgressTimeoutMs.ChatRecoveryExhaustedContext.reasongainswork_budget_exceededandrecovery_aborted;max_recovery_window_exceededis retained as an open-string value but is no longer emitted.Both
@cloudflare/ai-chatand@cloudflare/think(which carries its own copy of the recovery engine) are updated identically. Defaults are unchanged except that a progressing turn is no longer terminated by wall-clock age. -
#1668
d40cc8aThanks @ghostwriternr! - Fix RPC resource leaks in workflows.Workflows that use
waitForApproval()orThinkWorkflow.prompt()now release their RPC stubs promptly, preventing resource leaks and the associated "RPC stub was not disposed" warnings in your logs. -
#1679
c8d1d32Thanks @threepointone! - fix(sub-agents): a facet sub-agent no longer touches the root DO's WebSockets, fixing a production-only "Cannot perform I/O on behalf of a different Durable Object (Native)" crash (#1677)A sub-agent (facet) that called
setState(),broadcast(), or otherwise enumerated connections — directly or indirectly via the internal_broadcastProtocol()— could crash in production withCannot perform I/O on behalf of a different Durable Object. ... (I/O type: Native). It reproduced when the root Agent held a live (hibernatable) WebSocket connection and the child facet was freshly bootstrapped; it never reproduced inwrangler dev/miniflare, which made it hard to catch.Root cause: the
Agentoverrides ofgetConnections()andgetConnection()fell through tosuper.getConnections()/super.getConnection()for facets too. On a facet, that resolves to the host/root DO's hibernatable WebSockets, and reading their attachments from the facet's I/O context is a cross-DO native I/O access that workerd aborts.setState()tripped it only incidentally, because_broadcastProtocol()enumerates connections to compute its exclude list before sending anything.Fix: a facet's client connections are all virtual (real sockets owned by the root and bridged in), so
getConnections()/getConnection()now return only the facet's virtual sub-agent connections and never fall through to the host DO's sockets. Delivery of facet state updates to clients connected directly to the sub-agent is unchanged. -
#1670
5d64940Thanks @threepointone! - Fix: a deploy that interrupts an in-flightrunAgentToolchild no longer abandons the still-running child asinterrupted.Parent recovery re-attaches to a still-running child and tails it to its real terminal. Previously that re-attach used a flat 120s wall-clock budget that was not reset by the child's forward progress, so a healthy child whose recovery legitimately ran longer than the budget was sealed
interrupted(and its already-completed work re-run from scratch), even while it was actively streaming.The re-attach budget is now progress-keyed: it bounds how long the parent waits with no forward progress from the child (resetting on every forwarded chunk), so a genuinely hung/silent child still seals
interruptedafter one no-progress window and can never block recovery forever, while a healthy child that keeps streaming is followed through to terminal. The parent re-arms (opens a fresh tail) only when the child's stream closes cleanly while it is still advancing — i.e. a re-evicted-but-progressing child. A full no-progress window (the child went silent) sealsno-progressimmediately even if the child streamed earlier in that window; it no longer grants a bonus window. This is both the honest stall signal and what keeps at most one pending tail reader alive per re-attach (no per-cycle reader accumulation).@cloudflare/thinkand@cloudflare/ai-chatadditionally finalize a child facet's own agent-tool run row as soon as its recovered turn settles — regardless of whether recovery took the continue path (_chatRecoveryContinue) or the pre-stream retry path (_chatRecoveryRetry) — so a re-attached parent collects the terminal result immediately instead of waiting out a full no-progress window after the child has already finished.This release also adds:
- Typed interrupted cause.
RunAgentToolResult, theagentTool()AgentToolFailureenvelope, theonAgentToolFinishlifecycle result, and theagent-tool-eventwire event (kind"interrupted") now carry a machine-readablereason(AgentToolInterruptedReason:"no-progress" | "window-exceeded" | "not-tailable" | "inspect-timeout" | "inspect-failed" | "recovery-deadline") and achildStillRunningboolean oninterruptedresults, so callers (and UIs) can branch on why a run was abandoned (and whether the child is still running) instead of pattern-matching the human-readableerrorprose.retryablestays coarse (alwaystrueforinterrupted); refine withreason/childStillRunning. These fields are persisted (schema bump), so they survive a reconnect replay — a client that reconnects after an interrupt reconstructs the samereason/childStillRunninga live client saw, rather thanundefined. The persisted cause is cleared when a softinterruptedrow is later repaired tocompleted/error. - Configurable re-attach budgets. Two new public
AgentStaticOptions—agentToolReattachNoProgressTimeoutMs(default 120000, the progress-keyed no-progress budget) andagentToolReattachMaxWindowMs(defaultInfinity— no implicit wall-clock cap) — let an Agent tune re-attach. The hard ceiling defaults to uncapped to mirror chat-recovery'smaxRecoveryWork: Infinity: a re-attached parent follows a healthy, still-advancing child for as long as it makes progress — exactly as it would on the live (never-evicted) path — so it never abandons a long-running-but-healthy child that simply outlasts a fixed wall clock under deploy churn. A hung/silent child is bounded by the no-progress budget; a content-runaway is bounded uniformly (live and recovery) by the child's ownmaxRecoveryWork/shouldKeepRecovering. Integrators that want a hard wall-clock cap (and thewindow-exceededchild teardown it triggers) can setagentToolReattachMaxWindowMsto a finite value. Symmetrically, settingagentToolReattachNoProgressTimeoutMstoInfinitynow means "never seal on no-progress" (a silent-but-alive child is followed until its stream closes or the hard ceiling fires) instead of silently skipping the wait —0remains the "don't wait, collect only an already-terminal child" sentinel. - Give-up teardown (ceiling only). When the parent gives up at the hard
window-exceededceiling — where the child has had its full recovery window and is truly exhausted — it now cancels the child (childStillRunning: false) so it stops consuming a fiber / keep-alive.no-progressgive-ups stay soft (childStillRunning: true): the child is left running so a re-issue can still re-attach and repair it if it self-heals, preserving the repair-on-re-issue path. In both@cloudflare/thinkand@cloudflare/ai-chat,cancelAgentToolRunalso aborts an in-flight chat-recovery turn (not just the original in-isolate run) and releases live tails — Think sweeps its_submissionAbortControllers, ai-chat its requestAbortRegistry(abortAllRequests) — so a torn-down child stops grinding instead of finishing an orphaned recovered turn.
- Typed interrupted cause.
-
#1680
8f9500aThanks @threepointone! - Remove the now-redundant_suppressProtocolBroadcastsfacet-bootstrap guard.This flag was added in #1425 to stop
_broadcastProtocol()from enumerating the
parent DO's WebSockets during facet bootstrap (the cross-DO Native I/O crash,
#1410/#1677). The proper fix in #1679 makesgetConnections()/broadcast()
facet-safe at the source — on a facet they return only virtual sub-agent
connections and route through the parent bridge, never touching the parent's own
sockets. With that, suppressing broadcasts during bootstrap is unnecessary, and
removing it also lets legitimate state sync run during the bootstrap window.The separate request/WebSocket/email native-handle clearing from #1425 is
retained, since #1679 does not cover that vector. -
#1675
d915bc6Thanks @threepointone! - The skill runner now importsjust-bashand@cloudflare/codemodestatically instead of dynamically, and both have moved from optional peer dependencies to regular dependencies ofagents. The dynamic imports were ineffective in bundled Workers (the bundler includes them eagerly regardless) and triggeredINEFFECTIVE_DYNAMIC_IMPORTwarnings when bundled alongside@cloudflare/think, which imports them statically.@cloudflare/thinkalso now statically imports its internalExtensionManagerinstead of dynamically, removing the third such warning. -
#1662
df6c0d6Thanks @threepointone! - Add opt-in recovery for mid-turn context-window overflow.Compaction only fires between turns (
Session.compactAfterchecks the threshold onappendMessage). A single long, tool-heavy turn grows the prompt step-by-step inside onestreamTextloop and can exceed the model's context window mid-turn, before the next pre-turn check — the provider then 400s ("prompt is too long"/context_length_exceeded) and the turn dies terminally. Think deliberately ships no provider-specific error matching, so it could neither detect nor recover from this.This adds opt-in, provider-agnostic recovery (all default off — no behavior change unless enabled), configured through a single
contextOverflowproperty onThink:classifyChatError(error, ctx)— the app maps a raw error (or the in-stream error string) to aChatErrorClassification("context_overflow" | "rate_limit" | "transient" | "fatal" | "unknown"). Same framework-owns-the-mechanism / app-owns-the-provider-knowledge split astokenCounter. The classification is also threaded toonChatError/observers viaChatErrorContext.classification. The bundled, exporteddefaultContextOverflowClassifiercovers the common providers (Anthropic, OpenAI, Google, Bedrock, …) for apps that do not need custom classification.contextOverflow.reactive+contextOverflow.maxRetries— when a turn fails with acontext_overflowthe app classified, Think discards the truncated partial, runssession.compact(), and re-runs the turn (bounded) from the compacted history instead of dying. The partial is intentionally not persisted: the retry restarts the turn from scratch, so keeping the cut-off partial would orphan a half-finished assistant message beside the recovered answer (and duplicate any tool work the retry re-issues). A no-op compaction or a spent budget surfaces the overflow terminally throughonChatErrorwithclassification: "context_overflow"— never a silent end, never an infinite loop. Wired into the WebSocket,chat()/RPC, and programmatic (saveMessages/submitMessages) turn paths.contextOverflow.proactive— a{ maxInputTokens, headroom?, maxCompactions? }pre-step guard: when the previous step's model-reportedusage.inputTokenscrossesmaxInputTokens * (headroom ?? 0.9), Think compacts in place and feeds the recompacted history into the upcoming step, heading off the provider 400 before it happens. Keys off model-reported usage (every provider reports it), not provider error strings. Bounded per step loop by its ownmaxCompactions(default 1, independent of the reactivemaxRetriesbudget).
Also adds a
chat:context:compactedobservability event (agents) emitted (once) on both proactive and reactive compaction.Notes:
- Provider context-overflow errors always surface as in-stream error parts (confirmed against the AI SDK:
streamTextre-enqueues even top-level rejections as{ type: "error" }fullStream parts, andtoUIMessageStreampasses them through without throwing), so the in-stream seam catches them on every path; the thrown-error catch path does not need separate wiring. - Recovery effectiveness depends on the app's compaction config — a no-op compaction cannot rescue an over-budget turn (handled gracefully: terminal, not a loop). A one-time warning fires if
contextOverflow.reactiveis enabled butclassifyChatErrorwas never overridden.
-
#1675
d915bc6Thanks @threepointone! - Theagents/viteplugin now stubsturndownby default.turndown(pulled in transitively byjust-bashfor the workspace bash tool and skill runner) runs a top-levelrequire()in its Node DOM fallback, which throwsReferenceError: require is not definedat Worker startup — even when the bash tool is never used. The plugin replaces it with an inert stub so Workers deploys stay clean. Opt out withagents({ stubTurndown: false })if your app usesturndowndirectly.