cloudflare/agents @cloudflare/think@0.8.3 on GitHub

Patch Changes

#1684 ab6dd95 Thanks @threepointone! - warn when chatRecovery is configured in onStart() (applied too late for wake recovery)
On every Durable Object wake the SDK evaluates chat-recovery budgets — and may seal an interrupted turn, firing onExhausted — before the user's onStart() runs (_checkRunFibers() is ordered ahead of onStart()). A chatRecovery config produced inside onStart() is therefore read as the built-in defaults at the moment recovery decides, so a configured maxRecoveryWork / shouldKeepRecovering / onExhausted silently never applies to the recovery that matters.
This is now documented on ChatRecoveryConfig and the chatRecovery fields of Think / AIChatAgent, and the SDK logs a one-time warning if it detects chatRecovery being reassigned during onStart(). The warning fires both for a custom config object and for chatRecovery = true (enabling recovery / its defaults too late); assigning false (disabling) in onStart() is intentionally not warned, since recovery already ran with the pre-onStart() value and disabling it afterward is a benign no-op for that wake. The fix is to assign chatRecovery as a class field or in the constructor.
#1684 ab6dd95 Thanks @threepointone! - fix(chat-recovery): don't seal a human-in-the-loop turn that is waiting on a pending client tool call
A turn parked on a pending CLIENT interaction — an input-available client-tool part (no server execute) or an approval-requested part, as detected by hasPendingInteraction() — is waiting on the human, not stuck. After a mid-turn Durable Object restart (e.g. a deploy), the in-memory pending-interaction promise is gone, so waitUntilStable() repeatedly times out until the client reconnects and replays the tool-result/approval. That replay drives a fresh continuation via the auto-continuation barrier independently of recovery — but the recovery loop was treating those timeouts as deploy churn:
- each stable-state timeout burned a recovery attempt, eventually sealing a perfectly healthy turn with reason="stable_timeout", and
- the no-progress window (which never advances while no content is produced) could seal it with reason="no_progress_timeout" once it elapsed.
The net effect: an interrupted human-in-the-loop turn whose user simply took longer than the configured noProgressTimeoutMs / attempt budget to answer a tool prompt was terminalized with a "session interrupted" banner, even though nothing had actually failed.
While a client interaction is pending the turn is now budget-free:
- _beginChatRecoveryIncident suppresses the no-progress window, attempt cap, work budget, and shouldKeepRecovering predicate, and keeps the no-progress clock fresh so the turn gets a full window once the human finally answers.
- _chatRecoveryContinue / _chatRecoveryRetry park (mark the incident skipped with reason="awaiting_client_interaction", resolving the live "recovering…" indicator) instead of rescheduling or exhausting — the client's eventual replay resumes the turn. A client that never returns is reclaimed by the incident TTL sweep and DO idle-eviction.
In @cloudflare/think, a submitMessages-backed turn additionally has its durable submission row completed at park time. The recovery loop is that row's sole completion driver after a restart, and the client's replay resumes the conversation as an independent auto-continuation that never touches the submission — so parking without completing would leave the row running, and the next restart's _recoverSubmissionsOnStart would sweep it to error (a false "session recovery error"). The park condition is a fully-materialized client tool call in the leaf, which is exactly the terminal state a non-interrupted submission reaches when its step emits a client tool call (the model does not block on client tools), so completed is the correct, consistent outcome.
SERVER-tool orphans are deliberately excluded (their execute() died with the isolate and nothing will resolve them), so they still recover normally via the transcript-repair pass.
Both @cloudflare/think and @cloudflare/ai-chat (which carries its own copy of the recovery engine) are fixed. In @cloudflare/think the client/server distinction already lived in hasPendingInteraction(). @cloudflare/ai-chat's hasPendingInteraction() (used by waitUntilStable) does not distinguish client from server tools, so a new, narrower client-only predicate hasPendingClientInteraction() was added there and gates the exemption — leaving waitUntilStable's existing behavior untouched so server-tool orphans keep reschedule/exhaust semantics.
The exemption depends on knowing the request's client tools. @cloudflare/ai-chat restores them in its constructor, so they are available when boot recovery evaluates budgets. @cloudflare/think restored them in onStart(), which the base Agent runs after the boot-recovery path (_handleInternalFiberRecovery -> _beginChatRecoveryIncident) — so on a fresh wake the in-memory cache was still empty and a client-tool input-available orphan re-detected past the no-progress window was misread as "stuck" and wrongly sealed. _beginChatRecoveryIncident now re-hydrates _lastClientTools from the durable think_config store before evaluating the budget, closing that hibernation-ordering hole (approval-requested turns were never affected, since that branch does not depend on the client tool set).
#1672 f96a2ba Thanks @threepointone! - fix(chat-recovery): a turn making forward progress now survives unbounded deploy churn; add a work budget + shouldKeepRecovering runaway guard
Durable chat recovery used to bound a single incident with a non-resetting 15-minute wall-clock ceiling (CHAT_RECOVERY_MAX_WINDOW_MS). That ceiling was overloaded — it served as both a recovery-duration bound and a runaway-loop guard — and it terminated healthy, actively-progressing turns that simply took longer than 15 minutes of wall-clock to finish while being repeatedly interrupted by a dense deploy window, sealing them with reason="max_recovery_window_exceeded" and discarding completed work.
The two jobs are now decoupled (see design/rfc-chat-recovery-work-budget.md):
- Duration is no longer a bound for a progressing turn. The non-resetting wall-clock ceiling is removed. A turn that keeps producing content survives unbounded deploy churn. Stuck turns are still sealed by the no-progress window (5 min, resets on progress); tight no-progress alarm loops by the attempt cap.
- New runaway-loop guard, keyed to work, not time. The existing durable, monotonic, reconnect-immune progress counter is reused as a work meter. chatRecovery.maxRecoveryWork caps the produced content/tool units since an incident opened; exceeding it seals with reason="work_budget_exceeded". Defaults to Infinity — the SDK ships the mechanism but imposes no implicit cap, so it never terminates a progressing turn on its own.
- New caller predicate. chatRecovery.shouldKeepRecovering(ctx) is consulted per recovery attempt from the second onward (only when no hard bound has already sealed the incident); returning false seals with reason="recovery_aborted". This is where integrators express token/cost/step budgets the SDK should not hardcode. A throwing predicate is logged and treated as "keep recovering".
- The no-progress timeout is now configurable. chatRecovery.noProgressTimeoutMs (default 5 min, resets on progress) is the primary stuck-turn bound, now overridable per agent instead of a hardcoded constant.
New public types from agents/chat: ChatRecoveryProgressContext. New ChatRecoveryConfig fields: maxRecoveryWork, shouldKeepRecovering, noProgressTimeoutMs. ChatRecoveryExhaustedContext.reason gains work_budget_exceeded and recovery_aborted; max_recovery_window_exceeded is retained as an open-string value but is no longer emitted.
Both @cloudflare/ai-chat and @cloudflare/think (which carries its own copy of the recovery engine) are updated identically. Defaults are unchanged except that a progressing turn is no longer terminated by wall-clock age.
#1668 d40cc8a Thanks @ghostwriternr! - Fix RPC resource leaks in workflows.
Workflows that use waitForApproval() or ThinkWorkflow.prompt() now release their RPC stubs promptly, preventing resource leaks and the associated "RPC stub was not disposed" warnings in your logs.
#1670 5d64940 Thanks @threepointone! - Fix: a deploy that interrupts an in-flight runAgentTool child no longer abandons the still-running child as interrupted.
Parent recovery re-attaches to a still-running child and tails it to its real terminal. Previously that re-attach used a flat 120s wall-clock budget that was not reset by the child's forward progress, so a healthy child whose recovery legitimately ran longer than the budget was sealed interrupted (and its already-completed work re-run from scratch), even while it was actively streaming.
The re-attach budget is now progress-keyed: it bounds how long the parent waits with no forward progress from the child (resetting on every forwarded chunk), so a genuinely hung/silent child still seals interrupted after one no-progress window and can never block recovery forever, while a healthy child that keeps streaming is followed through to terminal. The parent re-arms (opens a fresh tail) only when the child's stream closes cleanly while it is still advancing — i.e. a re-evicted-but-progressing child. A full no-progress window (the child went silent) seals no-progress immediately even if the child streamed earlier in that window; it no longer grants a bonus window. This is both the honest stall signal and what keeps at most one pending tail reader alive per re-attach (no per-cycle reader accumulation).
@cloudflare/think and @cloudflare/ai-chat additionally finalize a child facet's own agent-tool run row as soon as its recovered turn settles — regardless of whether recovery took the continue path (_chatRecoveryContinue) or the pre-stream retry path (_chatRecoveryRetry) — so a re-attached parent collects the terminal result immediately instead of waiting out a full no-progress window after the child has already finished.
This release also adds:
- Typed interrupted cause. RunAgentToolResult, the agentTool() AgentToolFailure envelope, the onAgentToolFinish lifecycle result, and the agent-tool-event wire event (kind "interrupted") now carry a machine-readable reason (AgentToolInterruptedReason: "no-progress" | "window-exceeded" | "not-tailable" | "inspect-timeout" | "inspect-failed" | "recovery-deadline") and a childStillRunning boolean on interrupted results, so callers (and UIs) can branch on why a run was abandoned (and whether the child is still running) instead of pattern-matching the human-readable error prose. retryable stays coarse (always true for interrupted); refine with reason / childStillRunning. These fields are persisted (schema bump), so they survive a reconnect replay — a client that reconnects after an interrupt reconstructs the same reason / childStillRunning a live client saw, rather than undefined. The persisted cause is cleared when a soft interrupted row is later repaired to completed/error.
- Configurable re-attach budgets. Two new public AgentStaticOptions — agentToolReattachNoProgressTimeoutMs (default 120000, the progress-keyed no-progress budget) and agentToolReattachMaxWindowMs (default Infinity — no implicit wall-clock cap) — let an Agent tune re-attach. The hard ceiling defaults to uncapped to mirror chat-recovery's maxRecoveryWork: Infinity: a re-attached parent follows a healthy, still-advancing child for as long as it makes progress — exactly as it would on the live (never-evicted) path — so it never abandons a long-running-but-healthy child that simply outlasts a fixed wall clock under deploy churn. A hung/silent child is bounded by the no-progress budget; a content-runaway is bounded uniformly (live and recovery) by the child's own maxRecoveryWork / shouldKeepRecovering. Integrators that want a hard wall-clock cap (and the window-exceeded child teardown it triggers) can set agentToolReattachMaxWindowMs to a finite value. Symmetrically, setting agentToolReattachNoProgressTimeoutMs to Infinity now means "never seal on no-progress" (a silent-but-alive child is followed until its stream closes or the hard ceiling fires) instead of silently skipping the wait — 0 remains the "don't wait, collect only an already-terminal child" sentinel.
- Give-up teardown (ceiling only). When the parent gives up at the hard window-exceeded ceiling — where the child has had its full recovery window and is truly exhausted — it now cancels the child (childStillRunning: false) so it stops consuming a fiber / keep-alive. no-progress give-ups stay soft (childStillRunning: true): the child is left running so a re-issue can still re-attach and repair it if it self-heals, preserving the repair-on-re-issue path. In both @cloudflare/think and @cloudflare/ai-chat, cancelAgentToolRun also aborts an in-flight chat-recovery turn (not just the original in-isolate run) and releases live tails — Think sweeps its _submissionAbortControllers, ai-chat its request AbortRegistry (abortAllRequests) — so a torn-down child stops grinding instead of finishing an orphaned recovered turn.
#1675 d915bc6 Thanks @threepointone! - The skill runner now imports just-bash and @cloudflare/codemode statically instead of dynamically, and both have moved from optional peer dependencies to regular dependencies of agents. The dynamic imports were ineffective in bundled Workers (the bundler includes them eagerly regardless) and triggered INEFFECTIVE_DYNAMIC_IMPORT warnings when bundled alongside @cloudflare/think, which imports them statically. @cloudflare/think also now statically imports its internal ExtensionManager instead of dynamically, removing the third such warning.
#1662 df6c0d6 Thanks @threepointone! - Add opt-in recovery for mid-turn context-window overflow.
Compaction only fires between turns (Session.compactAfter checks the threshold on appendMessage). A single long, tool-heavy turn grows the prompt step-by-step inside one streamText loop and can exceed the model's context window mid-turn, before the next pre-turn check — the provider then 400s ("prompt is too long" / context_length_exceeded) and the turn dies terminally. Think deliberately ships no provider-specific error matching, so it could neither detect nor recover from this.
This adds opt-in, provider-agnostic recovery (all default off — no behavior change unless enabled), configured through a single contextOverflow property on Think:
- classifyChatError(error, ctx) — the app maps a raw error (or the in-stream error string) to a ChatErrorClassification ("context_overflow" | "rate_limit" | "transient" | "fatal" | "unknown"). Same framework-owns-the-mechanism / app-owns-the-provider-knowledge split as tokenCounter. The classification is also threaded to onChatError/observers via ChatErrorContext.classification. The bundled, exported defaultContextOverflowClassifier covers the common providers (Anthropic, OpenAI, Google, Bedrock, …) for apps that do not need custom classification.
- contextOverflow.reactive + contextOverflow.maxRetries — when a turn fails with a context_overflow the app classified, Think discards the truncated partial, runs session.compact(), and re-runs the turn (bounded) from the compacted history instead of dying. The partial is intentionally not persisted: the retry restarts the turn from scratch, so keeping the cut-off partial would orphan a half-finished assistant message beside the recovered answer (and duplicate any tool work the retry re-issues). A no-op compaction or a spent budget surfaces the overflow terminally through onChatError with classification: "context_overflow" — never a silent end, never an infinite loop. Wired into the WebSocket, chat()/RPC, and programmatic (saveMessages/submitMessages) turn paths.
- contextOverflow.proactive — a { maxInputTokens, headroom?, maxCompactions? } pre-step guard: when the previous step's model-reported usage.inputTokens crosses maxInputTokens * (headroom ?? 0.9), Think compacts in place and feeds the recompacted history into the upcoming step, heading off the provider 400 before it happens. Keys off model-reported usage (every provider reports it), not provider error strings. Bounded per step loop by its own maxCompactions (default 1, independent of the reactive maxRetries budget).
Also adds a chat:context:compacted observability event (agents) emitted (once) on both proactive and reactive compaction.
Notes:
- Provider context-overflow errors always surface as in-stream error parts (confirmed against the AI SDK: streamText re-enqueues even top-level rejections as { type: "error" } fullStream parts, and toUIMessageStream passes them through without throwing), so the in-stream seam catches them on every path; the thrown-error catch path does not need separate wiring.
- Recovery effectiveness depends on the app's compaction config — a no-op compaction cannot rescue an over-budget turn (handled gracefully: terminal, not a loop). A one-time warning fires if contextOverflow.reactive is enabled but classifyChatError was never overridden.