github cloudflare/agents @cloudflare/ai-chat@0.8.2

6 hours ago

Patch Changes

  • #1684 ab6dd95 Thanks @threepointone! - warn when chatRecovery is configured in onStart() (applied too late for wake recovery)

    On every Durable Object wake the SDK evaluates chat-recovery budgets — and may seal an interrupted turn, firing onExhaustedbefore the user's onStart() runs (_checkRunFibers() is ordered ahead of onStart()). A chatRecovery config produced inside onStart() is therefore read as the built-in defaults at the moment recovery decides, so a configured maxRecoveryWork / shouldKeepRecovering / onExhausted silently never applies to the recovery that matters.

    This is now documented on ChatRecoveryConfig and the chatRecovery fields of Think / AIChatAgent, and the SDK logs a one-time warning if it detects chatRecovery being reassigned during onStart(). The warning fires both for a custom config object and for chatRecovery = true (enabling recovery / its defaults too late); assigning false (disabling) in onStart() is intentionally not warned, since recovery already ran with the pre-onStart() value and disabling it afterward is a benign no-op for that wake. The fix is to assign chatRecovery as a class field or in the constructor.

  • #1684 ab6dd95 Thanks @threepointone! - fix(chat-recovery): don't seal a human-in-the-loop turn that is waiting on a pending client tool call

    A turn parked on a pending CLIENT interaction — an input-available client-tool part (no server execute) or an approval-requested part, as detected by hasPendingInteraction() — is waiting on the human, not stuck. After a mid-turn Durable Object restart (e.g. a deploy), the in-memory pending-interaction promise is gone, so waitUntilStable() repeatedly times out until the client reconnects and replays the tool-result/approval. That replay drives a fresh continuation via the auto-continuation barrier independently of recovery — but the recovery loop was treating those timeouts as deploy churn:

    • each stable-state timeout burned a recovery attempt, eventually sealing a perfectly healthy turn with reason="stable_timeout", and
    • the no-progress window (which never advances while no content is produced) could seal it with reason="no_progress_timeout" once it elapsed.

    The net effect: an interrupted human-in-the-loop turn whose user simply took longer than the configured noProgressTimeoutMs / attempt budget to answer a tool prompt was terminalized with a "session interrupted" banner, even though nothing had actually failed.

    While a client interaction is pending the turn is now budget-free:

    • _beginChatRecoveryIncident suppresses the no-progress window, attempt cap, work budget, and shouldKeepRecovering predicate, and keeps the no-progress clock fresh so the turn gets a full window once the human finally answers.
    • _chatRecoveryContinue / _chatRecoveryRetry park (mark the incident skipped with reason="awaiting_client_interaction", resolving the live "recovering…" indicator) instead of rescheduling or exhausting — the client's eventual replay resumes the turn. A client that never returns is reclaimed by the incident TTL sweep and DO idle-eviction.

    In @cloudflare/think, a submitMessages-backed turn additionally has its durable submission row completed at park time. The recovery loop is that row's sole completion driver after a restart, and the client's replay resumes the conversation as an independent auto-continuation that never touches the submission — so parking without completing would leave the row running, and the next restart's _recoverSubmissionsOnStart would sweep it to error (a false "session recovery error"). The park condition is a fully-materialized client tool call in the leaf, which is exactly the terminal state a non-interrupted submission reaches when its step emits a client tool call (the model does not block on client tools), so completed is the correct, consistent outcome.

    SERVER-tool orphans are deliberately excluded (their execute() died with the isolate and nothing will resolve them), so they still recover normally via the transcript-repair pass.

    Both @cloudflare/think and @cloudflare/ai-chat (which carries its own copy of the recovery engine) are fixed. In @cloudflare/think the client/server distinction already lived in hasPendingInteraction(). @cloudflare/ai-chat's hasPendingInteraction() (used by waitUntilStable) does not distinguish client from server tools, so a new, narrower client-only predicate hasPendingClientInteraction() was added there and gates the exemption — leaving waitUntilStable's existing behavior untouched so server-tool orphans keep reschedule/exhaust semantics.

    The exemption depends on knowing the request's client tools. @cloudflare/ai-chat restores them in its constructor, so they are available when boot recovery evaluates budgets. @cloudflare/think restored them in onStart(), which the base Agent runs after the boot-recovery path (_handleInternalFiberRecovery -> _beginChatRecoveryIncident) — so on a fresh wake the in-memory cache was still empty and a client-tool input-available orphan re-detected past the no-progress window was misread as "stuck" and wrongly sealed. _beginChatRecoveryIncident now re-hydrates _lastClientTools from the durable think_config store before evaluating the budget, closing that hibernation-ordering hole (approval-requested turns were never affected, since that branch does not depend on the client tool set).

  • #1672 f96a2ba Thanks @threepointone! - fix(chat-recovery): a turn making forward progress now survives unbounded deploy churn; add a work budget + shouldKeepRecovering runaway guard

    Durable chat recovery used to bound a single incident with a non-resetting 15-minute wall-clock ceiling (CHAT_RECOVERY_MAX_WINDOW_MS). That ceiling was overloaded — it served as both a recovery-duration bound and a runaway-loop guard — and it terminated healthy, actively-progressing turns that simply took longer than 15 minutes of wall-clock to finish while being repeatedly interrupted by a dense deploy window, sealing them with reason="max_recovery_window_exceeded" and discarding completed work.

    The two jobs are now decoupled (see design/rfc-chat-recovery-work-budget.md):

    • Duration is no longer a bound for a progressing turn. The non-resetting wall-clock ceiling is removed. A turn that keeps producing content survives unbounded deploy churn. Stuck turns are still sealed by the no-progress window (5 min, resets on progress); tight no-progress alarm loops by the attempt cap.
    • New runaway-loop guard, keyed to work, not time. The existing durable, monotonic, reconnect-immune progress counter is reused as a work meter. chatRecovery.maxRecoveryWork caps the produced content/tool units since an incident opened; exceeding it seals with reason="work_budget_exceeded". Defaults to Infinity — the SDK ships the mechanism but imposes no implicit cap, so it never terminates a progressing turn on its own.
    • New caller predicate. chatRecovery.shouldKeepRecovering(ctx) is consulted per recovery attempt from the second onward (only when no hard bound has already sealed the incident); returning false seals with reason="recovery_aborted". This is where integrators express token/cost/step budgets the SDK should not hardcode. A throwing predicate is logged and treated as "keep recovering".
    • The no-progress timeout is now configurable. chatRecovery.noProgressTimeoutMs (default 5 min, resets on progress) is the primary stuck-turn bound, now overridable per agent instead of a hardcoded constant.

    New public types from agents/chat: ChatRecoveryProgressContext. New ChatRecoveryConfig fields: maxRecoveryWork, shouldKeepRecovering, noProgressTimeoutMs. ChatRecoveryExhaustedContext.reason gains work_budget_exceeded and recovery_aborted; max_recovery_window_exceeded is retained as an open-string value but is no longer emitted.

    Both @cloudflare/ai-chat and @cloudflare/think (which carries its own copy of the recovery engine) are updated identically. Defaults are unchanged except that a progressing turn is no longer terminated by wall-clock age.

  • #1670 5d64940 Thanks @threepointone! - Fix: a deploy that interrupts an in-flight runAgentTool child no longer abandons the still-running child as interrupted.

    Parent recovery re-attaches to a still-running child and tails it to its real terminal. Previously that re-attach used a flat 120s wall-clock budget that was not reset by the child's forward progress, so a healthy child whose recovery legitimately ran longer than the budget was sealed interrupted (and its already-completed work re-run from scratch), even while it was actively streaming.

    The re-attach budget is now progress-keyed: it bounds how long the parent waits with no forward progress from the child (resetting on every forwarded chunk), so a genuinely hung/silent child still seals interrupted after one no-progress window and can never block recovery forever, while a healthy child that keeps streaming is followed through to terminal. The parent re-arms (opens a fresh tail) only when the child's stream closes cleanly while it is still advancing — i.e. a re-evicted-but-progressing child. A full no-progress window (the child went silent) seals no-progress immediately even if the child streamed earlier in that window; it no longer grants a bonus window. This is both the honest stall signal and what keeps at most one pending tail reader alive per re-attach (no per-cycle reader accumulation).

    @cloudflare/think and @cloudflare/ai-chat additionally finalize a child facet's own agent-tool run row as soon as its recovered turn settles — regardless of whether recovery took the continue path (_chatRecoveryContinue) or the pre-stream retry path (_chatRecoveryRetry) — so a re-attached parent collects the terminal result immediately instead of waiting out a full no-progress window after the child has already finished.

    This release also adds:

    • Typed interrupted cause. RunAgentToolResult, the agentTool() AgentToolFailure envelope, the onAgentToolFinish lifecycle result, and the agent-tool-event wire event (kind "interrupted") now carry a machine-readable reason (AgentToolInterruptedReason: "no-progress" | "window-exceeded" | "not-tailable" | "inspect-timeout" | "inspect-failed" | "recovery-deadline") and a childStillRunning boolean on interrupted results, so callers (and UIs) can branch on why a run was abandoned (and whether the child is still running) instead of pattern-matching the human-readable error prose. retryable stays coarse (always true for interrupted); refine with reason / childStillRunning. These fields are persisted (schema bump), so they survive a reconnect replay — a client that reconnects after an interrupt reconstructs the same reason / childStillRunning a live client saw, rather than undefined. The persisted cause is cleared when a soft interrupted row is later repaired to completed/error.
    • Configurable re-attach budgets. Two new public AgentStaticOptionsagentToolReattachNoProgressTimeoutMs (default 120000, the progress-keyed no-progress budget) and agentToolReattachMaxWindowMs (default Infinity — no implicit wall-clock cap) — let an Agent tune re-attach. The hard ceiling defaults to uncapped to mirror chat-recovery's maxRecoveryWork: Infinity: a re-attached parent follows a healthy, still-advancing child for as long as it makes progress — exactly as it would on the live (never-evicted) path — so it never abandons a long-running-but-healthy child that simply outlasts a fixed wall clock under deploy churn. A hung/silent child is bounded by the no-progress budget; a content-runaway is bounded uniformly (live and recovery) by the child's own maxRecoveryWork / shouldKeepRecovering. Integrators that want a hard wall-clock cap (and the window-exceeded child teardown it triggers) can set agentToolReattachMaxWindowMs to a finite value. Symmetrically, setting agentToolReattachNoProgressTimeoutMs to Infinity now means "never seal on no-progress" (a silent-but-alive child is followed until its stream closes or the hard ceiling fires) instead of silently skipping the wait — 0 remains the "don't wait, collect only an already-terminal child" sentinel.
    • Give-up teardown (ceiling only). When the parent gives up at the hard window-exceeded ceiling — where the child has had its full recovery window and is truly exhausted — it now cancels the child (childStillRunning: false) so it stops consuming a fiber / keep-alive. no-progress give-ups stay soft (childStillRunning: true): the child is left running so a re-issue can still re-attach and repair it if it self-heals, preserving the repair-on-re-issue path. In both @cloudflare/think and @cloudflare/ai-chat, cancelAgentToolRun also aborts an in-flight chat-recovery turn (not just the original in-isolate run) and releases live tails — Think sweeps its _submissionAbortControllers, ai-chat its request AbortRegistry (abortAllRequests) — so a torn-down child stops grinding instead of finishing an orphaned recovered turn.

Don't miss a new agents release

NewReleases is sending notifications on new releases.