github cloudflare/agents @cloudflare/ai-chat@0.8.0

7 hours ago

Minor Changes

  • #1636 f5a0d00 Thanks @threepointone! - Expose recovery incident identity and enrich the onExhausted payload so
    products can build a terminal-state policy without re-deriving anything (#1631).

    • ChatRecoveryContext (the onChatRecovery argument) now includes
      recoveryRootRequestId — the stable request ID for the whole continuation
      chain. Unlike requestId, it doesn't change across chained continuations, so
      it's the right key for per-incident budget tracking / fresh-incident detection
      without re-deriving identity from message IDs.
    • ChatRecoveryExhaustedContext (the onExhausted argument) now carries
      recoveryRootRequestId, terminalMessage (the exact text shown to the user),
      partialText / partialParts (what the turn produced before it was given up
      on), and streamId / createdAt — enough to render or persist a user-facing
      terminal banner AND emit correlated terminal telemetry (e.g. time-since-turn-start,
      stream correlation) directly, without re-deriving anything.

    All fields are additive. Applied across agents (shared types),
    @cloudflare/think, and @cloudflare/ai-chat.

  • #1648 d6827ab Thanks @threepointone! - Surface a live "recovering…" status to chat clients during durable recovery (#1620)

    When a durable chat turn is interrupted (a deploy/eviction, or a stream-stall
    watchdog abort) and resumes, clients had no "in progress" signal — the turn
    looked frozen until it completed or a terminal error was replayed. A new
    cf_agent_chat_recovering protocol frame is now broadcast on recovery schedule
    and cleared on every terminal outcome (completed/skipped/failed/exhausted), so
    the indicator can't spin forever. In @cloudflare/think it's also persisted and
    replayed on connect, so a client that joins mid-recovery learns the turn is
    working. useAgentChat exposes a new isRecovering flag (distinct from
    isStreaming — a recovering turn isn't producing tokens yet); most UIs render
    isStreaming || isRecovering as "busy". Backward-compatible: clients that don't
    understand the frame ignore it.

    Note: @cloudflare/ai-chat broadcasts the live signal but does not yet replay
    it on connect (it has no idle-connect hydration path; tracked in #1645).
    @cloudflare/think has both.

    For recovery telemetry, subscribe to the chat:recovery:* observability events
    and route them to your analytics sink.

  • #1611 02f9380 Thanks @threepointone! - Add bounded, observable recovery foundations for durable chat turns and fibers.

    • Add dedicated recovery observability channels/events for fibers, chat recovery, transcript repair, and agent-tool recovery.
    • Bound internal framework fiber recovery hooks and parent agent-tool recovery scans so startup and recovery work cannot wedge indefinitely.
    • Add shared chat recovery incident tracking with attempt counts, configurable chatRecovery defaults, and terminal exhaustion behavior for AIChatAgent and Think. Think recovery now exhausts after six failed attempts by default and sends a terminal error frame instead of spinning indefinitely.
    • Keep the recovery attempt budget bounded even when an interrupted turn flips between retry and continue recovery kinds (the incident identity no longer includes the kind), guard a throwing onExhausted hook so the terminal UX is still delivered, mark incidents failed when the recovery dispatch throws, and reclaim incident records on success plus a TTL sweep for abandoned ones so durable storage does not grow without bound.
    • Bound generic unmanaged fiber recovery with a configurable fiberRecoveryMaxAgeMs so a repeatedly-throwing onFiberRecovered() hook cannot re-trigger forever across restarts.
    • Surface Think post-persist chat request failures through onChatError(error, ctx) and chat:request:failed.
    • Repair incomplete Think tool-call transcripts before provider calls and allow createCompactFunction() to use a supplied token counter for tail budgeting.
  • #1638 b6c8dea Thanks @threepointone! - Make chat recovery's budget wall-clock-keyed-to-progress instead of raw attempt
    count, so a healthy turn under deploy churn isn't sealed prematurely (#1637).

    Under continuous deploys the attempt count is the wrong primary bound: one
    rollout drops/reconnects the socket several times (~11–22s), each firing a
    recovery alarm, so the count inflated far faster than the real interruption rate
    and exhausted turns that were still advancing (0/23 model calls errored in the
    reported incident — it was pure eviction churn).

    Now:

    • Primary bound: a 5-minute no-progress wall clock keyed to lastProgressAt,
      which resets on every progress-bearing attempt. A turn that keeps producing
      content survives churn indefinitely; one that genuinely goes quiet is sealed
      within 5 minutes.
    • Alarm debounce (~30s): recovery alarms bunched within the window (a single
      rollout's reconnect storm) collapse into one attempt.
    • Attempt cap is now a high secondary backstop (default raised 6 → 10),
      resets on progress; it only catches a pathological tight alarm-loop.
    • The existing 15-minute absolute incident-age ceiling is kept as the final
      non-resetting hard stop.
    • Progress signal moved to production time (when new content is durably
      flushed/streamed) instead of persist time — so it advances only on genuinely
      new content and is immune to client reconnects and recovery re-persists, which
      the no-progress window depends on. (Builds on the compaction-immune counter
      from #1628.)

    Applies to both @cloudflare/think and @cloudflare/ai-chat, including the
    TaskSubAgent/sub-agent recovery path.

Patch Changes

  • #1634 a4225fd Thanks @threepointone! - Stop chat recovery from discarding settled work when a turn is given up on
    (#1631).

    Two paths could throw away a partial assistant message containing completed,
    often non-idempotent tool results:

    • When the framework's own recovery budget was exhausted, _exhaustChatRecovery
      sealed the turn (terminal status + banner) before the orphaned stream was
      ever persisted — so every settled tool result the turn had produced was lost
      and the model re-ran them on the next message. Exhaustion now persists the
      settled partial first, using the same gating as the normal recovery path so it
      can't duplicate an already-saved partial.
    • A subclass onChatRecovery returning { persist: false } to stop a turn used
      to silently drop the settled partial. Settled work is now never dropped:
      persist: false only suppresses persistence of a partial that has nothing
      settled to lose; a partial carrying settled tool results is persisted
      regardless. An app can no longer accidentally discard completed work — and it
      never needs { persist: true } just to stay safe. (A safe default beats a
      warning about an unsafe one.)

    Applied identically to @cloudflare/think and @cloudflare/ai-chat.

  • #1633 1aca578 Thanks @threepointone! - Fix chat recovery prematurely exhausting its retry budget under compaction
    (#1628). The deploy-churn forward-progress signal — which resets the recovery
    budget when an interrupted turn is actually advancing — was recomputed from the
    live transcript by counting assistant messages. Compaction collapses older
    assistant messages into a summary, lowering that count, so a turn that had
    genuinely advanced could read as "no progress" between recovery attempts and
    exhaust at maxAttempts, sealing a healthy turn. Progress is now tracked by a
    durable, monotonic counter incremented when _persistOrphanedStream materializes
    a non-empty partial (the exact event the message count was proxying for), so
    compaction can never lower it. A turn that genuinely fails to advance still
    exhausts at the cap, and the 15-minute wall-clock ceiling is unchanged.

  • #1615 51a771f Thanks @threepointone! - Chat recovery no longer permanently abandons a turn under repeated deploys. A
    mid-turn deploy resets the Durable Object ("code was updated") and the
    interrupted continuation is re-detected on the next wake; previously every such
    interruption consumed one of the bounded recovery attempts, so a deploy every
    few minutes exhausted the budget (max_attempts_exceeded) and the turn was
    terminally abandoned even though each fresh isolate was healthy. Recovery now
    distinguishes an interruption that followed forward progress (more persisted
    assistant content than the previous attempt observed) — treated as environmental
    and not counted against the budget — from a turn that never advances, which still
    exhausts at maxAttempts. A 15-minute wall-clock ceiling per incident bounds the
    worst case so a continuously churning environment cannot retry forever.

  • #1608 7c17736 Thanks @cjol! - Fix auto-continuation stream resumes so immediate client-tool resume requests attach to the pending continuation instead of receiving cf_agent_stream_resume_none.

  • #1651 d118d11 Thanks @threepointone! - Fix auto-continuation firing before all parallel client-tool results arrive
    (#1649). When the model emitted multiple tool calls in one step and the client
    resolved them independently via addToolOutput, a fast result's autoContinue
    could trigger the next inference while a slower sibling was still
    input-available. That fed the provider an incomplete tool-result set
    (MissingToolResultsError) or — via the transcript-repair backstop — silently
    flipped the in-flight sibling to errored and ran a spurious extra continuation.

    Auto-continuation now waits until the transcript is stable (no
    input-available/approval-requested parts) before continuing, so a fanned-out
    tool batch coalesces into a single continuation regardless of result arrival
    order. The wait is bounded, so a genuinely orphaned tool call (e.g. the client
    disconnected mid-batch) still falls through to the existing backstop instead of
    pinning the continuation open.

  • #1641 3aa1936 Thanks @threepointone! - Count a sub-agent's progress as the orchestrating parent's recovery progress

    A parent turn whose work is "run a sub-agent and await its result" produced no
    recoverable content of its own, so under deploy churn the parent's own
    chat-recovery no-progress window could exhaust while the child was still
    healthily streaming — abandoning the turn as interrupted and collecting an
    interrupted result even though the child went on to complete. (Reproduced by
    the examples/deploy-churn --mode subagent harness: the parent exhausted at
    attempt 6/6 with progress: 1 while the child self-healed all 30 steps.)

    Forwarding a child's stream to the parent's connections is now treated as
    genuine forward progress for the parent's recovery budget: Think and
    AIChatAgent advance their durable recovery-progress marker (throttled) each
    time _forwardAgentToolStream forwards child output, so a parent that keeps
    re-attaching to and streaming a live child survives churn indefinitely. The
    credit is only granted when the child actually produces output — a silent or
    hung child still lets the parent exhaust on its own no-progress timer, so a
    stuck sub-agent can never pin a parent's recovery open forever.

    This completes the sub-agent recovery story started by the stable-runId +
    bounded re-attach fix (#1630): the child self-heals and the parent both
    re-attaches to it and keeps its own recovery alive while doing so.

  • #1646 a245a4a Thanks @threepointone! - Terminalize a chat-recovery turn through onExhausted when it gives up waiting for stable state

    Under extreme churn (a long turn interrupted many times in quick succession), a
    recovery callback (_chatRecoveryRetry / _chatRecoveryContinue) could keep
    timing out waiting for the isolate to reach stable state until its retry budget
    drained. The give-up path only marked the incident failed and completed the
    recovered submission as error — it bypassed _exhaustChatRecovery, so
    onExhausted never fired, the chat:recovery:exhausted event was not emitted,
    the configured terminalMessage banner was never delivered, and the terminal
    chat status was not recorded. Apps relying on onExhausted for the terminal
    banner saw an eternal spinner with no terminal signal.

    The stable-state-timeout give-up now routes through the same
    _exhaustChatRecovery path as deploy-recovery and stall exhaustion: it fires
    onExhausted (with reason: "stable_timeout"), emits chat:recovery:exhausted,
    marks the durable submission interrupted, records the terminal chat status, and
    delivers the terminalMessage. As an extra backstop against silent drops, the
    give-up also terminalizes when the incident record is missing (no incidentId,
    or it was swept/deleted before a stale alarm fired) by synthesizing a terminal
    incident from the recovery-root request id — so a turn can never be dropped with
    no terminal UX.

  • #1623 4c8b371 Thanks @threepointone! - Fix chat recovery falsely marking a durable submission as error under repeated mid-turn deploys.

    When several deploys interrupt a single turn, recovery runs a chain of continuations. Three bugs combined to leave the submission in error even when the turn actually completed every step:

    • Lost ownership. The submission link (recoveredRequestId) was derived from each continuation's own (fresh) requestId, so chained continuations dropped it — the continuation that finally completed the turn could no longer mark the submission completed.
    • Stale-continuation clobber. A superseded continuation tripped the conversation_changed guard because the leaf had advanced via recovery's own forward progress (a new assistant message), not a new user turn, and overwrote the still-running submission to error.
    • Premature stable_timeout. A timeout while waiting for the isolate to settle (common while a deploy is in flight) failed the turn terminally at the first attempt.

    Now: submission ownership is keyed off the stable recovery root and threaded through the entire continuation chain (including the terminal abandon paths — recovery exhaustion and { continue: false } — which previously marked the submission by the per-continuation requestId and so left a chained submission stuck running); a superseded continuation skips benignly (only a genuinely newer user turn marks the submission skipped, never error); and a stable-state timeout reschedules within the maxAttempts budget. A turn that completes under deploy churn now ends completed, not error.

    @cloudflare/ai-chat has the same recovery machinery but no durable-submission layer, so it receives the stable_timeout reschedule fix only: a transient stable-state timeout now retries within the attempt budget instead of permanently abandoning a recoverable turn at the first attempt.

  • #1606 7419fbc Thanks @threepointone! - Serialize client-tool continuation resumes so they do not overlap the active AI SDK chat request.

  • #1640 edb126a Thanks @threepointone! - Re-attach to a still-running sub-agent (agentTool()) run on parent recovery instead of abandoning and re-running it (#1630).

    When a parent agent was interrupted (deploy / Durable Object eviction) while a child agentTool() run was still in flight, recovery marked the run interrupted within a ~5s window and the parent re-issued the task — re-running the child's already-completed work. For long-running children under continuous deploys this surfaced to users as "the agent went all the way back and lost the files it already wrote."

    Three changes fix this:

    • Stable child runId. agentTool() now derives the child runId from the (recovery-preserved) tool call id (agent-tool:<toolCallId>) instead of minting a fresh nanoid per call. A turn re-run by chat recovery now resolves to the same idempotent child facet rather than spawning a brand-new one, so completed child work is never re-run.
    • Bounded re-attach. A duplicate non-terminal runId (in runAgentTool) and a still-running child during startup reconciliation now tail the live child to its real terminal result and collect it, instead of immediately sealing interrupted. Re-attach is bounded by a generous wall-clock budget (DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS, 120s, internal): a child that keeps advancing toward terminal within the window is collected; a genuinely hung child still seals interrupted so recovery can never block forever.
    • Durable child-run reconcile. A child facet self-heals its interrupted turn via its own chatRecovery, but that recovery path never wrote the child's agent-tool run row — so after a real eviction the row stranded running (think) / was force-errored (ai-chat) and the parent could never collect the recovered result. Both @cloudflare/think and @cloudflare/ai-chat now reconcile a stale child-run row from the durable transcript on inspect: while recovery is still resolving the row stays running; once it settles, a completed assistant response surfaces as completed (so the parent collects the real result) and an empty/failed recovery as error. This keeps the child's own (working) recovery path untouched.

    No new public configuration. Adds an internal agent_tool:recovery:reattach observability event. @cloudflare/think and @cloudflare/ai-chat child tails are now read-only on consumer detach (a parent's re-attach budget expiring never cancels the still-running child).

Don't miss a new agents release

NewReleases is sending notifications on new releases.