github cloudflare/agents @cloudflare/think@0.8.0

latest release: agents@0.14.0
7 hours ago

Minor Changes

  • #1636 f5a0d00 Thanks @threepointone! - Expose recovery incident identity and enrich the onExhausted payload so
    products can build a terminal-state policy without re-deriving anything (#1631).

    • ChatRecoveryContext (the onChatRecovery argument) now includes
      recoveryRootRequestId — the stable request ID for the whole continuation
      chain. Unlike requestId, it doesn't change across chained continuations, so
      it's the right key for per-incident budget tracking / fresh-incident detection
      without re-deriving identity from message IDs.
    • ChatRecoveryExhaustedContext (the onExhausted argument) now carries
      recoveryRootRequestId, terminalMessage (the exact text shown to the user),
      partialText / partialParts (what the turn produced before it was given up
      on), and streamId / createdAt — enough to render or persist a user-facing
      terminal banner AND emit correlated terminal telemetry (e.g. time-since-turn-start,
      stream correlation) directly, without re-deriving anything.

    All fields are additive. Applied across agents (shared types),
    @cloudflare/think, and @cloudflare/ai-chat.

  • #1584 87006e2 Thanks @threepointone! - Add a framework-agnostic Agent Skills engine at agents/skills: skill sources (fromManifest, R2), a SkillRegistry that produces a catalog prompt and AI SDK activation tools (activate_skill, read_skill_resource, run_skill_script), binary-safe resource reads, and qualified cross-skill resource paths. Bundled skills are imported through the Agents Vite plugin with the agents:skills specifier (defaulting to a ./skills directory), typed via ambient declarations shipped from agents. @cloudflare/think re-exports the engine as skills and wires getSkills() into the turn; any AI SDK caller (including @cloudflare/ai-chat) can build a SkillRegistry directly.

    Skill loading is resilient: duplicate or failing sources are skipped with a warning (first source wins) instead of throwing. Optional, experimental script execution (skills.runner) runs function-style JavaScript/TypeScript (export default run(input, ctx) with ctx = { skill, files, workspace, tools, output }) plus path-based Python and Bash, all behind a single capability and permission bridge.

  • #1648 d6827ab Thanks @threepointone! - Surface a live "recovering…" status to chat clients during durable recovery (#1620)

    When a durable chat turn is interrupted (a deploy/eviction, or a stream-stall
    watchdog abort) and resumes, clients had no "in progress" signal — the turn
    looked frozen until it completed or a terminal error was replayed. A new
    cf_agent_chat_recovering protocol frame is now broadcast on recovery schedule
    and cleared on every terminal outcome (completed/skipped/failed/exhausted), so
    the indicator can't spin forever. In @cloudflare/think it's also persisted and
    replayed on connect, so a client that joins mid-recovery learns the turn is
    working. useAgentChat exposes a new isRecovering flag (distinct from
    isStreaming — a recovering turn isn't producing tokens yet); most UIs render
    isStreaming || isRecovering as "busy". Backward-compatible: clients that don't
    understand the frame ignore it.

    Note: @cloudflare/ai-chat broadcasts the live signal but does not yet replay
    it on connect (it has no idle-connect hydration path; tracked in #1645).
    @cloudflare/think has both.

    For recovery telemetry, subscribe to the chat:recovery:* observability events
    and route them to your analytics sink.

  • #1611 02f9380 Thanks @threepointone! - Add bounded, observable recovery foundations for durable chat turns and fibers.

    • Add dedicated recovery observability channels/events for fibers, chat recovery, transcript repair, and agent-tool recovery.
    • Bound internal framework fiber recovery hooks and parent agent-tool recovery scans so startup and recovery work cannot wedge indefinitely.
    • Add shared chat recovery incident tracking with attempt counts, configurable chatRecovery defaults, and terminal exhaustion behavior for AIChatAgent and Think. Think recovery now exhausts after six failed attempts by default and sends a terminal error frame instead of spinning indefinitely.
    • Keep the recovery attempt budget bounded even when an interrupted turn flips between retry and continue recovery kinds (the incident identity no longer includes the kind), guard a throwing onExhausted hook so the terminal UX is still delivered, mark incidents failed when the recovery dispatch throws, and reclaim incident records on success plus a TTL sweep for abandoned ones so durable storage does not grow without bound.
    • Bound generic unmanaged fiber recovery with a configurable fiberRecoveryMaxAgeMs so a repeatedly-throwing onFiberRecovered() hook cannot re-trigger forever across restarts.
    • Surface Think post-persist chat request failures through onChatError(error, ctx) and chat:request:failed.
    • Repair incomplete Think tool-call transcripts before provider calls and allow createCompactFunction() to use a supplied token counter for tail budgeting.
  • #1638 b6c8dea Thanks @threepointone! - Make chat recovery's budget wall-clock-keyed-to-progress instead of raw attempt
    count, so a healthy turn under deploy churn isn't sealed prematurely (#1637).

    Under continuous deploys the attempt count is the wrong primary bound: one
    rollout drops/reconnects the socket several times (~11–22s), each firing a
    recovery alarm, so the count inflated far faster than the real interruption rate
    and exhausted turns that were still advancing (0/23 model calls errored in the
    reported incident — it was pure eviction churn).

    Now:

    • Primary bound: a 5-minute no-progress wall clock keyed to lastProgressAt,
      which resets on every progress-bearing attempt. A turn that keeps producing
      content survives churn indefinitely; one that genuinely goes quiet is sealed
      within 5 minutes.
    • Alarm debounce (~30s): recovery alarms bunched within the window (a single
      rollout's reconnect storm) collapse into one attempt.
    • Attempt cap is now a high secondary backstop (default raised 6 → 10),
      resets on progress; it only catches a pathological tight alarm-loop.
    • The existing 15-minute absolute incident-age ceiling is kept as the final
      non-resetting hard stop.
    • Progress signal moved to production time (when new content is durably
      flushed/streamed) instead of persist time — so it advances only on genuinely
      new content and is immune to client reconnects and recovery re-persists, which
      the no-progress window depends on. (Builds on the compaction-immune counter
      from #1628.)

    Applies to both @cloudflare/think and @cloudflare/ai-chat, including the
    TaskSubAgent/sub-agent recovery path.

  • #1587 32ea71e Thanks @threepointone! - Add first-class Think messengers with provider-neutral routing, durable Chat SDK state, streamed Think replies, action events, and a Telegram provider entrypoint.

    The messenger runtime depends directly on Chat SDK, supports provider-specific adapter names for multi-bot setups, and exposes the Telegram provider as both a named and default export.

  • #1643 bc86dce Thanks @threepointone! - Route a stream-stall watchdog abort into bounded recovery instead of a terminal error (#1626)

    When chatStreamStallTimeoutMs is set and the inactivity watchdog fires on a
    hung model/transport stream, the turn is no longer failed terminally. Because a
    stall is just another interruption — like a deploy or eviction — it is now
    routed into the same bounded chat-recovery path: the settled partial is
    preserved, a continuation is scheduled, and the turn resumes. A transient hang
    (the common case under deploy churn) recovers automatically; a persistently
    hanging provider still terminalizes once the recovery budget is exhausted (the
    watchdog's original "kill the infinite spinner" guarantee, now after bounded
    retries). Exhaustion goes through the same _exhaustChatRecovery path as
    deploy-recovery exhaustion, so your configured terminalMessage is delivered,
    onExhausted fires, and the chat:recovery:exhausted event is emitted — rather
    than leaking the raw "Chat stream stalled…" error.

    This is automatic whenever the watchdog is enabled and chatRecovery is on
    (the Think default) — no new configuration. Idempotency matches deploy
    recovery: settled tool results are durable and are not re-run, but a tool that
    was mid-execution when the stall fired re-runs on the continuation. With
    chatRecovery disabled, a stall stays terminal as before.

    Also adds a per-turn TurnConfig.chatStreamStallTimeoutMs override (returned
    from beforeTurn): a turn known to invoke a slow tool can raise or disable
    (0) the watchdog for that turn only, instead of permanently widening the
    instance-level window. It auto-resets after the turn.

  • #1647 0b29be5 Thanks @threepointone! - Add StreamCallback.onInterrupted() so a chat()-driven turn interrupted by recovery isn't silently abandoned

    When a turn driven through chat(userMessage, callback) is interrupted and routed
    into bounded recovery (a stream-stall watchdog abort), the scheduled continuation
    runs in a later isolate invocation without the original callback — so neither
    onDone() nor onError() ever fires for that callback. Because the isolate is
    still alive, the RPC promise resolves cleanly, and a consumer that keys off the
    clean resolve mis-reads it as success: it finalizes whatever partial it had
    streamed. For the built-in messenger delivery this meant posting a truncated
    answer as final, while the real recovered answer was produced later and broadcast
    only to WebSocket connections.

    StreamCallback now has an optional onInterrupted?() signal, emitted from the
    stall→recovery branches of the RPC stream path instead of returning silently. It
    means "not done, not a terminal error — a continuation owns the final outcome";
    consumers should keep the channel open / show a recovering state / re-attach
    rather than finalizing the partial. It is optional, so existing
    StreamCallback implementers are unaffected.

    Messenger delivery is wired to it: an interrupted reply now surfaces an
    "interrupted, please retry" message instead of finalizing the truncated partial.

    Note: a deploy/eviction interruption kills the isolate (and the callback) before
    this can fire — the caller observes a transport break instead. onInterrupted
    covers the in-isolate stall→recovery path.

  • #1598 f5e37bf Thanks @threepointone! - Add ThinkWorkflow with durable step.prompt() support for Workflow-owned Think reasoning steps.

  • #1635 5995fa8 Thanks @threepointone! - Add a repairInterruptedToolPart hook so subclasses can control how an
    interrupted tool call is repaired during transcript repair (#1631).

    Transcript repair flips a tool call with no settled result to an errored
    tool-result (preserving the record and keeping the provider from 400ing). That
    is the right default for server tools, but wrong for client-resolved tools like
    ask_user — a question with no server execute, answered by the user's next
    message — where the interrupted call is a question and should be preserved as
    text so the model sees normal Q→A conversation and compaction keeps the prompt
    verbatim. Because repair runs (and persists) before beforeTurn, a subclass had
    no way to shape this for the current turn.

    repairInterruptedToolPart(part) defaults to the existing errored-result
    behavior and runs during repair, so an override (e.g. converting an interrupted
    ask_user into a text part carrying the prompt) takes effect on the same turn,
    not just the next one.

  • #1623 4c8b371 Thanks @threepointone! - Add an opt-in inactivity watchdog for the streaming read loop, so a hung provider/transport surfaces a terminal error instead of an infinite spinner.

    Previously, if a model stream parked without ever throwing — no chunk, no error, no done — the chat read loop would wait forever and the client would spin indefinitely. There was no detection for a silently hung turn (only recovery-path stable_timeout, which guards recovery scheduling, not a live stream).

    Set chatStreamStallTimeoutMs on a Think subclass to arm it: if no UI-message-stream chunk arrives within that window, the watchdog aborts the turn and the loop exits with a terminal stream error (routed through onChatError with stage: "stream"), emitting a new chat:stream:stalled observability event.

    It is off by default (0) and applies to both the WebSocket turn loop and the chat() / sub-agent callback loop. Note it measures the gap between stream chunks, which includes server-side tool execution time (no chunks flow while a tool runs) — set it comfortably above your slowest model time-to-first-token and slowest tool, or you will abort healthy long turns. A good starting point is 120_000.

  • #1585 8ad724b Thanks @threepointone! - Add declarative scheduled tasks for Think agents with a typed recurring DSL, timezone-aware reconciliation, and durable idempotent submissions.

Patch Changes

  • #1634 a4225fd Thanks @threepointone! - Stop chat recovery from discarding settled work when a turn is given up on
    (#1631).

    Two paths could throw away a partial assistant message containing completed,
    often non-idempotent tool results:

    • When the framework's own recovery budget was exhausted, _exhaustChatRecovery
      sealed the turn (terminal status + banner) before the orphaned stream was
      ever persisted — so every settled tool result the turn had produced was lost
      and the model re-ran them on the next message. Exhaustion now persists the
      settled partial first, using the same gating as the normal recovery path so it
      can't duplicate an already-saved partial.
    • A subclass onChatRecovery returning { persist: false } to stop a turn used
      to silently drop the settled partial. Settled work is now never dropped:
      persist: false only suppresses persistence of a partial that has nothing
      settled to lose; a partial carrying settled tool results is persisted
      regardless. An app can no longer accidentally discard completed work — and it
      never needs { persist: true } just to stay safe. (A safe default beats a
      warning about an unsafe one.)

    Applied identically to @cloudflare/think and @cloudflare/ai-chat.

  • #1633 1aca578 Thanks @threepointone! - Fix chat recovery prematurely exhausting its retry budget under compaction
    (#1628). The deploy-churn forward-progress signal — which resets the recovery
    budget when an interrupted turn is actually advancing — was recomputed from the
    live transcript by counting assistant messages. Compaction collapses older
    assistant messages into a summary, lowering that count, so a turn that had
    genuinely advanced could read as "no progress" between recovery attempts and
    exhaust at maxAttempts, sealing a healthy turn. Progress is now tracked by a
    durable, monotonic counter incremented when _persistOrphanedStream materializes
    a non-empty partial (the exact event the message count was proxying for), so
    compaction can never lower it. A turn that genuinely fails to advance still
    exhausts at the cap, and the 15-minute wall-clock ceiling is unchanged.

  • #1615 51a771f Thanks @threepointone! - Chat recovery no longer permanently abandons a turn under repeated deploys. A
    mid-turn deploy resets the Durable Object ("code was updated") and the
    interrupted continuation is re-detected on the next wake; previously every such
    interruption consumed one of the bounded recovery attempts, so a deploy every
    few minutes exhausted the budget (max_attempts_exceeded) and the turn was
    terminally abandoned even though each fresh isolate was healthy. Recovery now
    distinguishes an interruption that followed forward progress (more persisted
    assistant content than the previous attempt observed) — treated as environmental
    and not counted against the budget — from a turn that never advances, which still
    exhausts at maxAttempts. A 15-minute wall-clock ceiling per incident bounds the
    worst case so a continuously churning environment cannot retry forever.

  • #1618 e6b6c0b Thanks @threepointone! - Chat continuation no longer fails on models that reject assistant-prefill.
    Continuing a partial assistant turn (e.g. after a deploy interrupts a stream)
    replayed a transcript whose final message was that partial assistant message.
    Modern chat models reject a request ending in an assistant message — Anthropic
    Claude 4.6+ returns a 400 ("This model does not support assistant message
    prefill. The conversation must end with a user message.") — so the continuation
    threw and the turn was left interrupted. Think now appends an ephemeral user
    "continue" checkpoint whenever a model request would otherwise end in an
    assistant message, so continuation works across providers. The checkpoint
    shapes only the model request and is never persisted to the transcript.

  • #1608 7c17736 Thanks @cjol! - Fix auto-continuation stream resumes so immediate client-tool resume requests attach to the pending continuation instead of receiving cf_agent_stream_resume_none.

  • #1651 d118d11 Thanks @threepointone! - Fix auto-continuation firing before all parallel client-tool results arrive
    (#1649). When the model emitted multiple tool calls in one step and the client
    resolved them independently via addToolOutput, a fast result's autoContinue
    could trigger the next inference while a slower sibling was still
    input-available. That fed the provider an incomplete tool-result set
    (MissingToolResultsError) or — via the transcript-repair backstop — silently
    flipped the in-flight sibling to errored and ran a spurious extra continuation.

    Auto-continuation now waits until the transcript is stable (no
    input-available/approval-requested parts) before continuing, so a fanned-out
    tool batch coalesces into a single continuation regardless of result arrival
    order. The wait is bounded, so a genuinely orphaned tool call (e.g. the client
    disconnected mid-batch) still falls through to the existing backstop instead of
    pinning the continuation open.

  • #1629 7d38363 Thanks @whoiskatrin! - Fix server-side needsApproval tool continuations remaining stuck after the
    user approves them. Think now keeps approved/denied/errored tool parts in the
    model transcript, updates its live transcript before an immediate continuation,
    and persists and broadcasts terminal tool output emitted for a prior assistant
    message. Continuation response frames are also labelled consistently so
    useAgentChat can apply streamed continuation updates to the active UI state.
    A pending approval-responded tool is no longer mis-reported by the
    incomplete-tool-call backstop, so approval continuations stop logging a false
    "repair gap" warning and emitting a spurious chat:transcript:repaired event.

    The cross-message tool result now flows through StreamAccumulator's
    cross-message-tool-update action and a shared, replay-safe
    crossMessageToolResultUpdate builder (exported from agents/chat): it matches
    terminal states for first-write-wins idempotency against provider replays (e.g.
    the OpenAI Responses API, #1404), preserves a streamed preliminary flag, and
    lets Think skip redundant writes/broadcasts when a result is already settled.

  • #1601 0fb0acf Thanks @threepointone! - Require fixed StreamCallback RPC handlers so sub-agent chat callbacks do not probe missing remote methods.

  • #1641 3aa1936 Thanks @threepointone! - Count a sub-agent's progress as the orchestrating parent's recovery progress

    A parent turn whose work is "run a sub-agent and await its result" produced no
    recoverable content of its own, so under deploy churn the parent's own
    chat-recovery no-progress window could exhaust while the child was still
    healthily streaming — abandoning the turn as interrupted and collecting an
    interrupted result even though the child went on to complete. (Reproduced by
    the examples/deploy-churn --mode subagent harness: the parent exhausted at
    attempt 6/6 with progress: 1 while the child self-healed all 30 steps.)

    Forwarding a child's stream to the parent's connections is now treated as
    genuine forward progress for the parent's recovery budget: Think and
    AIChatAgent advance their durable recovery-progress marker (throttled) each
    time _forwardAgentToolStream forwards child output, so a parent that keeps
    re-attaching to and streaming a live child survives churn indefinitely. The
    credit is only granted when the child actually produces output — a silent or
    hung child still lets the parent exhaust on its own no-progress timer, so a
    stuck sub-agent can never pin a parent's recovery open forever.

    This completes the sub-agent recovery story started by the stable-runId +
    bounded re-attach fix (#1630): the child self-heals and the parent both
    re-attaches to it and keeps its own recovery alive while doing so.

  • #1621 fac4463 Thanks @threepointone! - Settled tool results are now flushed to durable storage immediately during a
    chat turn, so recovery never re-runs an already-completed (often non-idempotent)
    tool call. Stream chunks are batched in memory and flushed to SQLite every ~10
    chunks; the WebSocket chat path did not force a flush on settled tool results,
    so an isolate eviction (deploy) before the next batch flush lost them. Recovery
    then rebuilt the partial assistant message without those tool calls and the
    model re-ran them (e.g. duplicate INSERTs). The sub-agent RPC streaming path
    already flushed recoverable content; this brings the WebSocket path to parity
    via a shared _storeChunkDurably helper that flushes immediately on
    tool-output-available / tool-output-error. Net effect: recovery loses at
    most the single in-flight step, even when multiple evictions hit one turn.

    Also closes two remaining "frozen turn" hydration gaps from the terminal-status
    work: a turn that fails before the stream starts (e.g. a message reconciliation
    error in _handleChatRequest) now records its terminal status, and a recovery
    skip caused by onChatRecovery returning { continue: false } now surfaces a
    terminal error too. Both were previously broadcast (or silent) but not persisted,
    so a client disconnected at that moment stayed frozen on reconnect. Benign skips
    such as conversation_changed (a newer turn already owns the UI) remain silent.

  • #1623 4c8b371 Thanks @threepointone! - Transcript repair now preserves an interrupted/abandoned tool call as an errored result instead of deleting it.

    Previously, a tool call with no recorded output (e.g. a tool interrupted mid-execution by a deploy, or an ask_user answered by the user's next message) was removed from the durable transcript before the next turn. That made the call visibly "disappear" from the broadcast transcript and let the model silently re-run it (duplicating non-idempotent side effects).

    It is now flipped to state: "output-error" with an explanatory message, so:

    • the user-visible record survives (no disappearing tool calls),
    • the model sees the tool errored rather than re-running it blind, and
    • the provider still receives a valid tool-result (no AI_MissingToolResultsError).

    Malformed tool inputs are normalized in the same pass: a stringified-JSON input is parsed back into an object, and a missing/null input on a settled or interrupted tool call is defaulted to {} (Anthropic rejects a tool_use block whose input is absent).

    As a last-line backstop, convertToModelMessages is now called with ignoreIncompleteToolCalls: true, so any incomplete tool call that still slips past the repair (compaction edges, addToolOutput races, unrecognized part shapes) is dropped at conversion rather than 400ing the provider.

    Repair recognizes all of the AI SDK's settled terminal tool states — output-available, output-error, and output-denied (a user-denied approval) — via a single shared predicate, so a tool call that already has a provider-acceptable result is never re-flipped into a generic errored result. Previously output-error was re-flipped on every turn (clobbering a real errorText with the generic "interrupted" message and emitting spurious chat:transcript:repaired events/writes/broadcasts for the life of the conversation), and output-denied was converted into an errored result that lost the denial. A denied tool result is also now flushed to durable storage immediately (like other settled results) so it survives an eviction.

  • #1646 a245a4a Thanks @threepointone! - Terminalize a chat-recovery turn through onExhausted when it gives up waiting for stable state

    Under extreme churn (a long turn interrupted many times in quick succession), a
    recovery callback (_chatRecoveryRetry / _chatRecoveryContinue) could keep
    timing out waiting for the isolate to reach stable state until its retry budget
    drained. The give-up path only marked the incident failed and completed the
    recovered submission as error — it bypassed _exhaustChatRecovery, so
    onExhausted never fired, the chat:recovery:exhausted event was not emitted,
    the configured terminalMessage banner was never delivered, and the terminal
    chat status was not recorded. Apps relying on onExhausted for the terminal
    banner saw an eternal spinner with no terminal signal.

    The stable-state-timeout give-up now routes through the same
    _exhaustChatRecovery path as deploy-recovery and stall exhaustion: it fires
    onExhausted (with reason: "stable_timeout"), emits chat:recovery:exhausted,
    marks the durable submission interrupted, records the terminal chat status, and
    delivers the terminalMessage. As an extra backstop against silent drops, the
    give-up also terminalizes when the incident record is missing (no incidentId,
    or it was swept/deleted before a stale alarm fired) by synthesizing a terminal
    incident from the recovery-root request id — so a turn can never be dropped with
    no terminal UX.

  • #1623 4c8b371 Thanks @threepointone! - Fix chat recovery falsely marking a durable submission as error under repeated mid-turn deploys.

    When several deploys interrupt a single turn, recovery runs a chain of continuations. Three bugs combined to leave the submission in error even when the turn actually completed every step:

    • Lost ownership. The submission link (recoveredRequestId) was derived from each continuation's own (fresh) requestId, so chained continuations dropped it — the continuation that finally completed the turn could no longer mark the submission completed.
    • Stale-continuation clobber. A superseded continuation tripped the conversation_changed guard because the leaf had advanced via recovery's own forward progress (a new assistant message), not a new user turn, and overwrote the still-running submission to error.
    • Premature stable_timeout. A timeout while waiting for the isolate to settle (common while a deploy is in flight) failed the turn terminally at the first attempt.

    Now: submission ownership is keyed off the stable recovery root and threaded through the entire continuation chain (including the terminal abandon paths — recovery exhaustion and { continue: false } — which previously marked the submission by the per-continuation requestId and so left a chained submission stuck running); a superseded continuation skips benignly (only a genuinely newer user turn marks the submission skipped, never error); and a stable-state timeout reschedules within the maxAttempts budget. A turn that completes under deploy churn now ends completed, not error.

    @cloudflare/ai-chat has the same recovery machinery but no durable-submission layer, so it receives the stable_timeout reschedule fix only: a transient stable-state timeout now retries within the attempt budget instead of permanently abandoning a recoverable turn at the first attempt.

  • #1619 6d1a8f9 Thanks @threepointone! - Interrupted/failed chat turns are no longer silently "frozen" for clients that
    reconnect after the failure. The terminal MSG_CHAT_RESPONSE broadcast (on a
    turn error or exhausted recovery) is transient — a client disconnected at that
    moment (e.g. during a deploy / WebSocket reconnect storm) misses it, and on
    reconnect onConnect previously replayed only the current messages with no
    terminal signal, so the turn appeared stuck with no completed response and no
    error. Think now persists a durable record of the last terminal turn and
    replays it on connect, so a reconnecting client learns the turn failed. The
    record is cleared when a later turn completes; benign recovery skips (e.g.
    conversation_changed, where a newer turn owns the UI) are intentionally not
    surfaced.

  • #1640 edb126a Thanks @threepointone! - Re-attach to a still-running sub-agent (agentTool()) run on parent recovery instead of abandoning and re-running it (#1630).

    When a parent agent was interrupted (deploy / Durable Object eviction) while a child agentTool() run was still in flight, recovery marked the run interrupted within a ~5s window and the parent re-issued the task — re-running the child's already-completed work. For long-running children under continuous deploys this surfaced to users as "the agent went all the way back and lost the files it already wrote."

    Three changes fix this:

    • Stable child runId. agentTool() now derives the child runId from the (recovery-preserved) tool call id (agent-tool:<toolCallId>) instead of minting a fresh nanoid per call. A turn re-run by chat recovery now resolves to the same idempotent child facet rather than spawning a brand-new one, so completed child work is never re-run.
    • Bounded re-attach. A duplicate non-terminal runId (in runAgentTool) and a still-running child during startup reconciliation now tail the live child to its real terminal result and collect it, instead of immediately sealing interrupted. Re-attach is bounded by a generous wall-clock budget (DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS, 120s, internal): a child that keeps advancing toward terminal within the window is collected; a genuinely hung child still seals interrupted so recovery can never block forever.
    • Durable child-run reconcile. A child facet self-heals its interrupted turn via its own chatRecovery, but that recovery path never wrote the child's agent-tool run row — so after a real eviction the row stranded running (think) / was force-errored (ai-chat) and the parent could never collect the recovered result. Both @cloudflare/think and @cloudflare/ai-chat now reconcile a stale child-run row from the durable transcript on inspect: while recovery is still resolving the row stays running; once it settles, a completed assistant response surfaces as completed (so the parent collects the real result) and an empty/failed recovery as error. This keeps the child's own (working) recovery path untouched.

    No new public configuration. Adds an internal agent_tool:recovery:reattach observability event. @cloudflare/think and @cloudflare/ai-chat child tails are now read-only on consumer detach (a parent's re-attach budget expiring never cancels the still-running child).

Don't miss a new agents release

NewReleases is sending notifications on new releases.