Minor Changes
-
#1636
f5a0d00Thanks @threepointone! - Expose recovery incident identity and enrich theonExhaustedpayload so
products can build a terminal-state policy without re-deriving anything (#1631).ChatRecoveryContext(theonChatRecoveryargument) now includes
recoveryRootRequestId— the stable request ID for the whole continuation
chain. UnlikerequestId, it doesn't change across chained continuations, so
it's the right key for per-incident budget tracking / fresh-incident detection
without re-deriving identity from message IDs.ChatRecoveryExhaustedContext(theonExhaustedargument) now carries
recoveryRootRequestId,terminalMessage(the exact text shown to the user),
partialText/partialParts(what the turn produced before it was given up
on), andstreamId/createdAt— enough to render or persist a user-facing
terminal banner AND emit correlated terminal telemetry (e.g. time-since-turn-start,
stream correlation) directly, without re-deriving anything.
All fields are additive. Applied across
agents(shared types),
@cloudflare/think, and@cloudflare/ai-chat. -
#1584
87006e2Thanks @threepointone! - Add a framework-agnostic Agent Skills engine atagents/skills: skill sources (fromManifest, R2), aSkillRegistrythat produces a catalog prompt and AI SDK activation tools (activate_skill,read_skill_resource,run_skill_script), binary-safe resource reads, and qualified cross-skill resource paths. Bundled skills are imported through the Agents Vite plugin with theagents:skillsspecifier (defaulting to a./skillsdirectory), typed via ambient declarations shipped fromagents.@cloudflare/thinkre-exports the engine asskillsand wiresgetSkills()into the turn; any AI SDK caller (including@cloudflare/ai-chat) can build aSkillRegistrydirectly.Skill loading is resilient: duplicate or failing sources are skipped with a warning (first source wins) instead of throwing. Optional, experimental script execution (
skills.runner) runs function-style JavaScript/TypeScript (export default run(input, ctx)withctx = { skill, files, workspace, tools, output }) plus path-based Python and Bash, all behind a single capability and permission bridge. -
#1648
d6827abThanks @threepointone! - Surface a live "recovering…" status to chat clients during durable recovery (#1620)When a durable chat turn is interrupted (a deploy/eviction, or a stream-stall
watchdog abort) and resumes, clients had no "in progress" signal — the turn
looked frozen until it completed or a terminal error was replayed. A new
cf_agent_chat_recoveringprotocol frame is now broadcast on recovery schedule
and cleared on every terminal outcome (completed/skipped/failed/exhausted), so
the indicator can't spin forever. In@cloudflare/thinkit's also persisted and
replayed on connect, so a client that joins mid-recovery learns the turn is
working.useAgentChatexposes a newisRecoveringflag (distinct from
isStreaming— a recovering turn isn't producing tokens yet); most UIs render
isStreaming || isRecoveringas "busy". Backward-compatible: clients that don't
understand the frame ignore it.Note:
@cloudflare/ai-chatbroadcasts the live signal but does not yet replay
it on connect (it has no idle-connect hydration path; tracked in #1645).
@cloudflare/thinkhas both.For recovery telemetry, subscribe to the
chat:recovery:*observability events
and route them to your analytics sink. -
#1611
02f9380Thanks @threepointone! - Add bounded, observable recovery foundations for durable chat turns and fibers.- Add dedicated recovery observability channels/events for fibers, chat recovery, transcript repair, and agent-tool recovery.
- Bound internal framework fiber recovery hooks and parent agent-tool recovery scans so startup and recovery work cannot wedge indefinitely.
- Add shared chat recovery incident tracking with attempt counts, configurable
chatRecoverydefaults, and terminal exhaustion behavior forAIChatAgentandThink. Think recovery now exhausts after six failed attempts by default and sends a terminal error frame instead of spinning indefinitely. - Keep the recovery attempt budget bounded even when an interrupted turn flips between
retryandcontinuerecovery kinds (the incident identity no longer includes the kind), guard a throwingonExhaustedhook so the terminal UX is still delivered, mark incidentsfailedwhen the recovery dispatch throws, and reclaim incident records on success plus a TTL sweep for abandoned ones so durable storage does not grow without bound. - Bound generic unmanaged fiber recovery with a configurable
fiberRecoveryMaxAgeMsso a repeatedly-throwingonFiberRecovered()hook cannot re-trigger forever across restarts. - Surface Think post-persist chat request failures through
onChatError(error, ctx)andchat:request:failed. - Repair incomplete Think tool-call transcripts before provider calls and allow
createCompactFunction()to use a supplied token counter for tail budgeting.
-
#1638
b6c8deaThanks @threepointone! - Make chat recovery's budget wall-clock-keyed-to-progress instead of raw attempt
count, so a healthy turn under deploy churn isn't sealed prematurely (#1637).Under continuous deploys the attempt count is the wrong primary bound: one
rollout drops/reconnects the socket several times (~11–22s), each firing a
recovery alarm, so the count inflated far faster than the real interruption rate
and exhausted turns that were still advancing (0/23 model calls errored in the
reported incident — it was pure eviction churn).Now:
- Primary bound: a 5-minute no-progress wall clock keyed to
lastProgressAt,
which resets on every progress-bearing attempt. A turn that keeps producing
content survives churn indefinitely; one that genuinely goes quiet is sealed
within 5 minutes. - Alarm debounce (~30s): recovery alarms bunched within the window (a single
rollout's reconnect storm) collapse into one attempt. - Attempt cap is now a high secondary backstop (default raised 6 → 10),
resets on progress; it only catches a pathological tight alarm-loop. - The existing 15-minute absolute incident-age ceiling is kept as the final
non-resetting hard stop. - Progress signal moved to production time (when new content is durably
flushed/streamed) instead of persist time — so it advances only on genuinely
new content and is immune to client reconnects and recovery re-persists, which
the no-progress window depends on. (Builds on the compaction-immune counter
from #1628.)
Applies to both
@cloudflare/thinkand@cloudflare/ai-chat, including the
TaskSubAgent/sub-agent recovery path. - Primary bound: a 5-minute no-progress wall clock keyed to
-
#1587
32ea71eThanks @threepointone! - Add first-class Think messengers with provider-neutral routing, durable Chat SDK state, streamed Think replies, action events, and a Telegram provider entrypoint.The messenger runtime depends directly on Chat SDK, supports provider-specific adapter names for multi-bot setups, and exposes the Telegram provider as both a named and default export.
-
#1643
bc86dceThanks @threepointone! - Route a stream-stall watchdog abort into bounded recovery instead of a terminal error (#1626)When
chatStreamStallTimeoutMsis set and the inactivity watchdog fires on a
hung model/transport stream, the turn is no longer failed terminally. Because a
stall is just another interruption — like a deploy or eviction — it is now
routed into the same bounded chat-recovery path: the settled partial is
preserved, a continuation is scheduled, and the turn resumes. A transient hang
(the common case under deploy churn) recovers automatically; a persistently
hanging provider still terminalizes once the recovery budget is exhausted (the
watchdog's original "kill the infinite spinner" guarantee, now after bounded
retries). Exhaustion goes through the same_exhaustChatRecoverypath as
deploy-recovery exhaustion, so your configuredterminalMessageis delivered,
onExhaustedfires, and thechat:recovery:exhaustedevent is emitted — rather
than leaking the raw"Chat stream stalled…"error.This is automatic whenever the watchdog is enabled and
chatRecoveryis on
(the Think default) — no new configuration. Idempotency matches deploy
recovery: settled tool results are durable and are not re-run, but a tool that
was mid-execution when the stall fired re-runs on the continuation. With
chatRecoverydisabled, a stall stays terminal as before.Also adds a per-turn
TurnConfig.chatStreamStallTimeoutMsoverride (returned
frombeforeTurn): a turn known to invoke a slow tool can raise or disable
(0) the watchdog for that turn only, instead of permanently widening the
instance-level window. It auto-resets after the turn. -
#1647
0b29be5Thanks @threepointone! - AddStreamCallback.onInterrupted()so achat()-driven turn interrupted by recovery isn't silently abandonedWhen a turn driven through
chat(userMessage, callback)is interrupted and routed
into bounded recovery (a stream-stall watchdog abort), the scheduled continuation
runs in a later isolate invocation without the original callback — so neither
onDone()noronError()ever fires for that callback. Because the isolate is
still alive, the RPC promise resolves cleanly, and a consumer that keys off the
clean resolve mis-reads it as success: it finalizes whatever partial it had
streamed. For the built-in messenger delivery this meant posting a truncated
answer as final, while the real recovered answer was produced later and broadcast
only to WebSocket connections.StreamCallbacknow has an optionalonInterrupted?()signal, emitted from the
stall→recovery branches of the RPC stream path instead of returning silently. It
means "not done, not a terminal error — a continuation owns the final outcome";
consumers should keep the channel open / show a recovering state / re-attach
rather than finalizing the partial. It is optional, so existing
StreamCallbackimplementers are unaffected.Messenger delivery is wired to it: an interrupted reply now surfaces an
"interrupted, please retry" message instead of finalizing the truncated partial.Note: a deploy/eviction interruption kills the isolate (and the callback) before
this can fire — the caller observes a transport break instead.onInterrupted
covers the in-isolate stall→recovery path. -
#1598
f5e37bfThanks @threepointone! - AddThinkWorkflowwith durablestep.prompt()support for Workflow-owned Think reasoning steps. -
#1635
5995fa8Thanks @threepointone! - Add arepairInterruptedToolParthook so subclasses can control how an
interrupted tool call is repaired during transcript repair (#1631).Transcript repair flips a tool call with no settled result to an errored
tool-result (preserving the record and keeping the provider from 400ing). That
is the right default for server tools, but wrong for client-resolved tools like
ask_user— a question with no serverexecute, answered by the user's next
message — where the interrupted call is a question and should be preserved as
text so the model sees normal Q→A conversation and compaction keeps the prompt
verbatim. Because repair runs (and persists) beforebeforeTurn, a subclass had
no way to shape this for the current turn.repairInterruptedToolPart(part)defaults to the existing errored-result
behavior and runs during repair, so an override (e.g. converting an interrupted
ask_userinto a text part carrying the prompt) takes effect on the same turn,
not just the next one. -
#1623
4c8b371Thanks @threepointone! - Add an opt-in inactivity watchdog for the streaming read loop, so a hung provider/transport surfaces a terminal error instead of an infinite spinner.Previously, if a model stream parked without ever throwing — no chunk, no error, no
done— the chat read loop would wait forever and the client would spin indefinitely. There was no detection for a silently hung turn (only recovery-pathstable_timeout, which guards recovery scheduling, not a live stream).Set
chatStreamStallTimeoutMson a Think subclass to arm it: if no UI-message-stream chunk arrives within that window, the watchdog aborts the turn and the loop exits with a terminal stream error (routed throughonChatErrorwithstage: "stream"), emitting a newchat:stream:stalledobservability event.It is off by default (
0) and applies to both the WebSocket turn loop and thechat()/ sub-agent callback loop. Note it measures the gap between stream chunks, which includes server-side tool execution time (no chunks flow while a tool runs) — set it comfortably above your slowest model time-to-first-token and slowest tool, or you will abort healthy long turns. A good starting point is120_000. -
#1585
8ad724bThanks @threepointone! - Add declarative scheduled tasks for Think agents with a typed recurring DSL, timezone-aware reconciliation, and durable idempotent submissions.
Patch Changes
-
#1634
a4225fdThanks @threepointone! - Stop chat recovery from discarding settled work when a turn is given up on
(#1631).Two paths could throw away a partial assistant message containing completed,
often non-idempotent tool results:- When the framework's own recovery budget was exhausted,
_exhaustChatRecovery
sealed the turn (terminal status + banner) before the orphaned stream was
ever persisted — so every settled tool result the turn had produced was lost
and the model re-ran them on the next message. Exhaustion now persists the
settled partial first, using the same gating as the normal recovery path so it
can't duplicate an already-saved partial. - A subclass
onChatRecoveryreturning{ persist: false }to stop a turn used
to silently drop the settled partial. Settled work is now never dropped:
persist: falseonly suppresses persistence of a partial that has nothing
settled to lose; a partial carrying settled tool results is persisted
regardless. An app can no longer accidentally discard completed work — and it
never needs{ persist: true }just to stay safe. (A safe default beats a
warning about an unsafe one.)
Applied identically to
@cloudflare/thinkand@cloudflare/ai-chat. - When the framework's own recovery budget was exhausted,
-
#1633
1aca578Thanks @threepointone! - Fix chat recovery prematurely exhausting its retry budget under compaction
(#1628). The deploy-churn forward-progress signal — which resets the recovery
budget when an interrupted turn is actually advancing — was recomputed from the
live transcript by counting assistant messages. Compaction collapses older
assistant messages into a summary, lowering that count, so a turn that had
genuinely advanced could read as "no progress" between recovery attempts and
exhaust atmaxAttempts, sealing a healthy turn. Progress is now tracked by a
durable, monotonic counter incremented when_persistOrphanedStreammaterializes
a non-empty partial (the exact event the message count was proxying for), so
compaction can never lower it. A turn that genuinely fails to advance still
exhausts at the cap, and the 15-minute wall-clock ceiling is unchanged. -
#1615
51a771fThanks @threepointone! - Chat recovery no longer permanently abandons a turn under repeated deploys. A
mid-turn deploy resets the Durable Object ("code was updated") and the
interrupted continuation is re-detected on the next wake; previously every such
interruption consumed one of the bounded recovery attempts, so a deploy every
few minutes exhausted the budget (max_attempts_exceeded) and the turn was
terminally abandoned even though each fresh isolate was healthy. Recovery now
distinguishes an interruption that followed forward progress (more persisted
assistant content than the previous attempt observed) — treated as environmental
and not counted against the budget — from a turn that never advances, which still
exhausts atmaxAttempts. A 15-minute wall-clock ceiling per incident bounds the
worst case so a continuously churning environment cannot retry forever. -
#1618
e6b6c0bThanks @threepointone! - Chat continuation no longer fails on models that reject assistant-prefill.
Continuing a partial assistant turn (e.g. after a deploy interrupts a stream)
replayed a transcript whose final message was that partial assistant message.
Modern chat models reject a request ending in an assistant message — Anthropic
Claude 4.6+ returns a 400 ("This model does not support assistant message
prefill. The conversation must end with a user message.") — so the continuation
threw and the turn was left interrupted. Think now appends an ephemeral user
"continue" checkpoint whenever a model request would otherwise end in an
assistant message, so continuation works across providers. The checkpoint
shapes only the model request and is never persisted to the transcript. -
#1608
7c17736Thanks @cjol! - Fix auto-continuation stream resumes so immediate client-tool resume requests attach to the pending continuation instead of receivingcf_agent_stream_resume_none. -
#1651
d118d11Thanks @threepointone! - Fix auto-continuation firing before all parallel client-tool results arrive
(#1649). When the model emitted multiple tool calls in one step and the client
resolved them independently viaaddToolOutput, a fast result'sautoContinue
could trigger the next inference while a slower sibling was still
input-available. That fed the provider an incomplete tool-result set
(MissingToolResultsError) or — via the transcript-repair backstop — silently
flipped the in-flight sibling to errored and ran a spurious extra continuation.Auto-continuation now waits until the transcript is stable (no
input-available/approval-requestedparts) before continuing, so a fanned-out
tool batch coalesces into a single continuation regardless of result arrival
order. The wait is bounded, so a genuinely orphaned tool call (e.g. the client
disconnected mid-batch) still falls through to the existing backstop instead of
pinning the continuation open. -
#1629
7d38363Thanks @whoiskatrin! - Fix server-sideneedsApprovaltool continuations remaining stuck after the
user approves them. Think now keeps approved/denied/errored tool parts in the
model transcript, updates its live transcript before an immediate continuation,
and persists and broadcasts terminal tool output emitted for a prior assistant
message. Continuation response frames are also labelled consistently so
useAgentChatcan apply streamed continuation updates to the active UI state.
A pendingapproval-respondedtool is no longer mis-reported by the
incomplete-tool-call backstop, so approval continuations stop logging a false
"repair gap" warning and emitting a spuriouschat:transcript:repairedevent.The cross-message tool result now flows through
StreamAccumulator's
cross-message-tool-updateaction and a shared, replay-safe
crossMessageToolResultUpdatebuilder (exported fromagents/chat): it matches
terminal states for first-write-wins idempotency against provider replays (e.g.
the OpenAI Responses API, #1404), preserves a streamedpreliminaryflag, and
letsThinkskip redundant writes/broadcasts when a result is already settled. -
#1601
0fb0acfThanks @threepointone! - Require fixed StreamCallback RPC handlers so sub-agent chat callbacks do not probe missing remote methods. -
#1641
3aa1936Thanks @threepointone! - Count a sub-agent's progress as the orchestrating parent's recovery progressA parent turn whose work is "run a sub-agent and await its result" produced no
recoverable content of its own, so under deploy churn the parent's own
chat-recovery no-progress window could exhaust while the child was still
healthily streaming — abandoning the turn asinterruptedand collecting an
interrupted result even though the child went on to complete. (Reproduced by
theexamples/deploy-churn --mode subagentharness: the parent exhausted at
attempt 6/6withprogress: 1while the child self-healed all 30 steps.)Forwarding a child's stream to the parent's connections is now treated as
genuine forward progress for the parent's recovery budget:Thinkand
AIChatAgentadvance their durable recovery-progress marker (throttled) each
time_forwardAgentToolStreamforwards child output, so a parent that keeps
re-attaching to and streaming a live child survives churn indefinitely. The
credit is only granted when the child actually produces output — a silent or
hung child still lets the parent exhaust on its own no-progress timer, so a
stuck sub-agent can never pin a parent's recovery open forever.This completes the sub-agent recovery story started by the stable-runId +
bounded re-attach fix (#1630): the child self-heals and the parent both
re-attaches to it and keeps its own recovery alive while doing so. -
#1621
fac4463Thanks @threepointone! - Settled tool results are now flushed to durable storage immediately during a
chat turn, so recovery never re-runs an already-completed (often non-idempotent)
tool call. Stream chunks are batched in memory and flushed to SQLite every ~10
chunks; the WebSocket chat path did not force a flush on settled tool results,
so an isolate eviction (deploy) before the next batch flush lost them. Recovery
then rebuilt the partial assistant message without those tool calls and the
model re-ran them (e.g. duplicate INSERTs). The sub-agent RPC streaming path
already flushed recoverable content; this brings the WebSocket path to parity
via a shared_storeChunkDurablyhelper that flushes immediately on
tool-output-available/tool-output-error. Net effect: recovery loses at
most the single in-flight step, even when multiple evictions hit one turn.Also closes two remaining "frozen turn" hydration gaps from the terminal-status
work: a turn that fails before the stream starts (e.g. a message reconciliation
error in_handleChatRequest) now records its terminal status, and a recovery
skip caused byonChatRecoveryreturning{ continue: false }now surfaces a
terminal error too. Both were previously broadcast (or silent) but not persisted,
so a client disconnected at that moment stayed frozen on reconnect. Benign skips
such asconversation_changed(a newer turn already owns the UI) remain silent. -
#1623
4c8b371Thanks @threepointone! - Transcript repair now preserves an interrupted/abandoned tool call as an errored result instead of deleting it.Previously, a tool call with no recorded output (e.g. a tool interrupted mid-execution by a deploy, or an
ask_useranswered by the user's next message) was removed from the durable transcript before the next turn. That made the call visibly "disappear" from the broadcast transcript and let the model silently re-run it (duplicating non-idempotent side effects).It is now flipped to
state: "output-error"with an explanatory message, so:- the user-visible record survives (no disappearing tool calls),
- the model sees the tool errored rather than re-running it blind, and
- the provider still receives a valid tool-result (no
AI_MissingToolResultsError).
Malformed tool
inputs are normalized in the same pass: a stringified-JSONinputis parsed back into an object, and a missing/nullinputon a settled or interrupted tool call is defaulted to{}(Anthropic rejects atool_useblock whoseinputis absent).As a last-line backstop,
convertToModelMessagesis now called withignoreIncompleteToolCalls: true, so any incomplete tool call that still slips past the repair (compaction edges,addToolOutputraces, unrecognized part shapes) is dropped at conversion rather than 400ing the provider.Repair recognizes all of the AI SDK's settled terminal tool states —
output-available,output-error, andoutput-denied(a user-denied approval) — via a single shared predicate, so a tool call that already has a provider-acceptable result is never re-flipped into a generic errored result. Previouslyoutput-errorwas re-flipped on every turn (clobbering a realerrorTextwith the generic "interrupted" message and emitting spuriouschat:transcript:repairedevents/writes/broadcasts for the life of the conversation), andoutput-deniedwas converted into an errored result that lost the denial. A denied tool result is also now flushed to durable storage immediately (like other settled results) so it survives an eviction. -
#1646
a245a4aThanks @threepointone! - Terminalize a chat-recovery turn throughonExhaustedwhen it gives up waiting for stable stateUnder extreme churn (a long turn interrupted many times in quick succession), a
recovery callback (_chatRecoveryRetry/_chatRecoveryContinue) could keep
timing out waiting for the isolate to reach stable state until its retry budget
drained. The give-up path only marked the incidentfailedand completed the
recovered submission aserror— it bypassed_exhaustChatRecovery, so
onExhaustednever fired, thechat:recovery:exhaustedevent was not emitted,
the configuredterminalMessagebanner was never delivered, and the terminal
chat status was not recorded. Apps relying ononExhaustedfor the terminal
banner saw an eternal spinner with no terminal signal.The stable-state-timeout give-up now routes through the same
_exhaustChatRecoverypath as deploy-recovery and stall exhaustion: it fires
onExhausted(withreason: "stable_timeout"), emitschat:recovery:exhausted,
marks the durable submission interrupted, records the terminal chat status, and
delivers theterminalMessage. As an extra backstop against silent drops, the
give-up also terminalizes when the incident record is missing (noincidentId,
or it was swept/deleted before a stale alarm fired) by synthesizing a terminal
incident from the recovery-root request id — so a turn can never be dropped with
no terminal UX. -
#1623
4c8b371Thanks @threepointone! - Fix chat recovery falsely marking a durable submission aserrorunder repeated mid-turn deploys.When several deploys interrupt a single turn, recovery runs a chain of continuations. Three bugs combined to leave the submission in
erroreven when the turn actually completed every step:- Lost ownership. The submission link (
recoveredRequestId) was derived from each continuation's own (fresh) requestId, so chained continuations dropped it — the continuation that finally completed the turn could no longer mark the submissioncompleted. - Stale-continuation clobber. A superseded continuation tripped the
conversation_changedguard because the leaf had advanced via recovery's own forward progress (a new assistant message), not a new user turn, and overwrote the still-running submission toerror. - Premature
stable_timeout. A timeout while waiting for the isolate to settle (common while a deploy is in flight) failed the turn terminally at the first attempt.
Now: submission ownership is keyed off the stable recovery root and threaded through the entire continuation chain (including the terminal abandon paths — recovery exhaustion and
{ continue: false }— which previously marked the submission by the per-continuation requestId and so left a chained submission stuckrunning); a superseded continuation skips benignly (only a genuinely newer user turn marks the submissionskipped, nevererror); and a stable-state timeout reschedules within themaxAttemptsbudget. A turn that completes under deploy churn now endscompleted, noterror.@cloudflare/ai-chathas the same recovery machinery but no durable-submission layer, so it receives thestable_timeoutreschedule fix only: a transient stable-state timeout now retries within the attempt budget instead of permanently abandoning a recoverable turn at the first attempt. - Lost ownership. The submission link (
-
#1619
6d1a8f9Thanks @threepointone! - Interrupted/failed chat turns are no longer silently "frozen" for clients that
reconnect after the failure. The terminalMSG_CHAT_RESPONSEbroadcast (on a
turn error or exhausted recovery) is transient — a client disconnected at that
moment (e.g. during a deploy / WebSocket reconnect storm) misses it, and on
reconnectonConnectpreviously replayed only the current messages with no
terminal signal, so the turn appeared stuck with no completed response and no
error. Think now persists a durable record of the last terminal turn and
replays it on connect, so a reconnecting client learns the turn failed. The
record is cleared when a later turn completes; benign recovery skips (e.g.
conversation_changed, where a newer turn owns the UI) are intentionally not
surfaced. -
#1640
edb126aThanks @threepointone! - Re-attach to a still-running sub-agent (agentTool()) run on parent recovery instead of abandoning and re-running it (#1630).When a parent agent was interrupted (deploy / Durable Object eviction) while a child
agentTool()run was still in flight, recovery marked the runinterruptedwithin a ~5s window and the parent re-issued the task — re-running the child's already-completed work. For long-running children under continuous deploys this surfaced to users as "the agent went all the way back and lost the files it already wrote."Three changes fix this:
- Stable child runId.
agentTool()now derives the childrunIdfrom the (recovery-preserved) tool call id (agent-tool:<toolCallId>) instead of minting a freshnanoidper call. A turn re-run by chat recovery now resolves to the same idempotent child facet rather than spawning a brand-new one, so completed child work is never re-run. - Bounded re-attach. A duplicate non-terminal
runId(inrunAgentTool) and a still-running child during startup reconciliation now tail the live child to its real terminal result and collect it, instead of immediately sealinginterrupted. Re-attach is bounded by a generous wall-clock budget (DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS, 120s, internal): a child that keeps advancing toward terminal within the window is collected; a genuinely hung child still sealsinterruptedso recovery can never block forever. - Durable child-run reconcile. A child facet self-heals its interrupted turn via its own
chatRecovery, but that recovery path never wrote the child's agent-tool run row — so after a real eviction the row strandedrunning(think) / was force-errored (ai-chat) and the parent could never collect the recovered result. Both@cloudflare/thinkand@cloudflare/ai-chatnow reconcile a stale child-run row from the durable transcript on inspect: while recovery is still resolving the row staysrunning; once it settles, a completed assistant response surfaces ascompleted(so the parent collects the real result) and an empty/failed recovery aserror. This keeps the child's own (working) recovery path untouched.
No new public configuration. Adds an internal
agent_tool:recovery:reattachobservability event.@cloudflare/thinkand@cloudflare/ai-chatchild tails are now read-only on consumer detach (a parent's re-attach budget expiring never cancels the still-running child). - Stable child runId.