Minor Changes
-
#1636
f5a0d00Thanks @threepointone! - Expose recovery incident identity and enrich theonExhaustedpayload so
products can build a terminal-state policy without re-deriving anything (#1631).ChatRecoveryContext(theonChatRecoveryargument) now includes
recoveryRootRequestId— the stable request ID for the whole continuation
chain. UnlikerequestId, it doesn't change across chained continuations, so
it's the right key for per-incident budget tracking / fresh-incident detection
without re-deriving identity from message IDs.ChatRecoveryExhaustedContext(theonExhaustedargument) now carries
recoveryRootRequestId,terminalMessage(the exact text shown to the user),
partialText/partialParts(what the turn produced before it was given up
on), andstreamId/createdAt— enough to render or persist a user-facing
terminal banner AND emit correlated terminal telemetry (e.g. time-since-turn-start,
stream correlation) directly, without re-deriving anything.
All fields are additive. Applied across
agents(shared types),
@cloudflare/think, and@cloudflare/ai-chat. -
#1648
d6827abThanks @threepointone! - Surface a live "recovering…" status to chat clients during durable recovery (#1620)When a durable chat turn is interrupted (a deploy/eviction, or a stream-stall
watchdog abort) and resumes, clients had no "in progress" signal — the turn
looked frozen until it completed or a terminal error was replayed. A new
cf_agent_chat_recoveringprotocol frame is now broadcast on recovery schedule
and cleared on every terminal outcome (completed/skipped/failed/exhausted), so
the indicator can't spin forever. In@cloudflare/thinkit's also persisted and
replayed on connect, so a client that joins mid-recovery learns the turn is
working.useAgentChatexposes a newisRecoveringflag (distinct from
isStreaming— a recovering turn isn't producing tokens yet); most UIs render
isStreaming || isRecoveringas "busy". Backward-compatible: clients that don't
understand the frame ignore it.Note:
@cloudflare/ai-chatbroadcasts the live signal but does not yet replay
it on connect (it has no idle-connect hydration path; tracked in #1645).
@cloudflare/thinkhas both.For recovery telemetry, subscribe to the
chat:recovery:*observability events
and route them to your analytics sink. -
#1611
02f9380Thanks @threepointone! - Add bounded, observable recovery foundations for durable chat turns and fibers.- Add dedicated recovery observability channels/events for fibers, chat recovery, transcript repair, and agent-tool recovery.
- Bound internal framework fiber recovery hooks and parent agent-tool recovery scans so startup and recovery work cannot wedge indefinitely.
- Add shared chat recovery incident tracking with attempt counts, configurable
chatRecoverydefaults, and terminal exhaustion behavior forAIChatAgentandThink. Think recovery now exhausts after six failed attempts by default and sends a terminal error frame instead of spinning indefinitely. - Keep the recovery attempt budget bounded even when an interrupted turn flips between
retryandcontinuerecovery kinds (the incident identity no longer includes the kind), guard a throwingonExhaustedhook so the terminal UX is still delivered, mark incidentsfailedwhen the recovery dispatch throws, and reclaim incident records on success plus a TTL sweep for abandoned ones so durable storage does not grow without bound. - Bound generic unmanaged fiber recovery with a configurable
fiberRecoveryMaxAgeMsso a repeatedly-throwingonFiberRecovered()hook cannot re-trigger forever across restarts. - Surface Think post-persist chat request failures through
onChatError(error, ctx)andchat:request:failed. - Repair incomplete Think tool-call transcripts before provider calls and allow
createCompactFunction()to use a supplied token counter for tail budgeting.
-
#1638
b6c8deaThanks @threepointone! - Make chat recovery's budget wall-clock-keyed-to-progress instead of raw attempt
count, so a healthy turn under deploy churn isn't sealed prematurely (#1637).Under continuous deploys the attempt count is the wrong primary bound: one
rollout drops/reconnects the socket several times (~11–22s), each firing a
recovery alarm, so the count inflated far faster than the real interruption rate
and exhausted turns that were still advancing (0/23 model calls errored in the
reported incident — it was pure eviction churn).Now:
- Primary bound: a 5-minute no-progress wall clock keyed to
lastProgressAt,
which resets on every progress-bearing attempt. A turn that keeps producing
content survives churn indefinitely; one that genuinely goes quiet is sealed
within 5 minutes. - Alarm debounce (~30s): recovery alarms bunched within the window (a single
rollout's reconnect storm) collapse into one attempt. - Attempt cap is now a high secondary backstop (default raised 6 → 10),
resets on progress; it only catches a pathological tight alarm-loop. - The existing 15-minute absolute incident-age ceiling is kept as the final
non-resetting hard stop. - Progress signal moved to production time (when new content is durably
flushed/streamed) instead of persist time — so it advances only on genuinely
new content and is immune to client reconnects and recovery re-persists, which
the no-progress window depends on. (Builds on the compaction-immune counter
from #1628.)
Applies to both
@cloudflare/thinkand@cloudflare/ai-chat, including the
TaskSubAgent/sub-agent recovery path. - Primary bound: a 5-minute no-progress wall clock keyed to
Patch Changes
-
#1634
a4225fdThanks @threepointone! - Stop chat recovery from discarding settled work when a turn is given up on
(#1631).Two paths could throw away a partial assistant message containing completed,
often non-idempotent tool results:- When the framework's own recovery budget was exhausted,
_exhaustChatRecovery
sealed the turn (terminal status + banner) before the orphaned stream was
ever persisted — so every settled tool result the turn had produced was lost
and the model re-ran them on the next message. Exhaustion now persists the
settled partial first, using the same gating as the normal recovery path so it
can't duplicate an already-saved partial. - A subclass
onChatRecoveryreturning{ persist: false }to stop a turn used
to silently drop the settled partial. Settled work is now never dropped:
persist: falseonly suppresses persistence of a partial that has nothing
settled to lose; a partial carrying settled tool results is persisted
regardless. An app can no longer accidentally discard completed work — and it
never needs{ persist: true }just to stay safe. (A safe default beats a
warning about an unsafe one.)
Applied identically to
@cloudflare/thinkand@cloudflare/ai-chat. - When the framework's own recovery budget was exhausted,
-
#1633
1aca578Thanks @threepointone! - Fix chat recovery prematurely exhausting its retry budget under compaction
(#1628). The deploy-churn forward-progress signal — which resets the recovery
budget when an interrupted turn is actually advancing — was recomputed from the
live transcript by counting assistant messages. Compaction collapses older
assistant messages into a summary, lowering that count, so a turn that had
genuinely advanced could read as "no progress" between recovery attempts and
exhaust atmaxAttempts, sealing a healthy turn. Progress is now tracked by a
durable, monotonic counter incremented when_persistOrphanedStreammaterializes
a non-empty partial (the exact event the message count was proxying for), so
compaction can never lower it. A turn that genuinely fails to advance still
exhausts at the cap, and the 15-minute wall-clock ceiling is unchanged. -
#1615
51a771fThanks @threepointone! - Chat recovery no longer permanently abandons a turn under repeated deploys. A
mid-turn deploy resets the Durable Object ("code was updated") and the
interrupted continuation is re-detected on the next wake; previously every such
interruption consumed one of the bounded recovery attempts, so a deploy every
few minutes exhausted the budget (max_attempts_exceeded) and the turn was
terminally abandoned even though each fresh isolate was healthy. Recovery now
distinguishes an interruption that followed forward progress (more persisted
assistant content than the previous attempt observed) — treated as environmental
and not counted against the budget — from a turn that never advances, which still
exhausts atmaxAttempts. A 15-minute wall-clock ceiling per incident bounds the
worst case so a continuously churning environment cannot retry forever. -
#1608
7c17736Thanks @cjol! - Fix auto-continuation stream resumes so immediate client-tool resume requests attach to the pending continuation instead of receivingcf_agent_stream_resume_none. -
#1651
d118d11Thanks @threepointone! - Fix auto-continuation firing before all parallel client-tool results arrive
(#1649). When the model emitted multiple tool calls in one step and the client
resolved them independently viaaddToolOutput, a fast result'sautoContinue
could trigger the next inference while a slower sibling was still
input-available. That fed the provider an incomplete tool-result set
(MissingToolResultsError) or — via the transcript-repair backstop — silently
flipped the in-flight sibling to errored and ran a spurious extra continuation.Auto-continuation now waits until the transcript is stable (no
input-available/approval-requestedparts) before continuing, so a fanned-out
tool batch coalesces into a single continuation regardless of result arrival
order. The wait is bounded, so a genuinely orphaned tool call (e.g. the client
disconnected mid-batch) still falls through to the existing backstop instead of
pinning the continuation open. -
#1641
3aa1936Thanks @threepointone! - Count a sub-agent's progress as the orchestrating parent's recovery progressA parent turn whose work is "run a sub-agent and await its result" produced no
recoverable content of its own, so under deploy churn the parent's own
chat-recovery no-progress window could exhaust while the child was still
healthily streaming — abandoning the turn asinterruptedand collecting an
interrupted result even though the child went on to complete. (Reproduced by
theexamples/deploy-churn --mode subagentharness: the parent exhausted at
attempt 6/6withprogress: 1while the child self-healed all 30 steps.)Forwarding a child's stream to the parent's connections is now treated as
genuine forward progress for the parent's recovery budget:Thinkand
AIChatAgentadvance their durable recovery-progress marker (throttled) each
time_forwardAgentToolStreamforwards child output, so a parent that keeps
re-attaching to and streaming a live child survives churn indefinitely. The
credit is only granted when the child actually produces output — a silent or
hung child still lets the parent exhaust on its own no-progress timer, so a
stuck sub-agent can never pin a parent's recovery open forever.This completes the sub-agent recovery story started by the stable-runId +
bounded re-attach fix (#1630): the child self-heals and the parent both
re-attaches to it and keeps its own recovery alive while doing so. -
#1646
a245a4aThanks @threepointone! - Terminalize a chat-recovery turn throughonExhaustedwhen it gives up waiting for stable stateUnder extreme churn (a long turn interrupted many times in quick succession), a
recovery callback (_chatRecoveryRetry/_chatRecoveryContinue) could keep
timing out waiting for the isolate to reach stable state until its retry budget
drained. The give-up path only marked the incidentfailedand completed the
recovered submission aserror— it bypassed_exhaustChatRecovery, so
onExhaustednever fired, thechat:recovery:exhaustedevent was not emitted,
the configuredterminalMessagebanner was never delivered, and the terminal
chat status was not recorded. Apps relying ononExhaustedfor the terminal
banner saw an eternal spinner with no terminal signal.The stable-state-timeout give-up now routes through the same
_exhaustChatRecoverypath as deploy-recovery and stall exhaustion: it fires
onExhausted(withreason: "stable_timeout"), emitschat:recovery:exhausted,
marks the durable submission interrupted, records the terminal chat status, and
delivers theterminalMessage. As an extra backstop against silent drops, the
give-up also terminalizes when the incident record is missing (noincidentId,
or it was swept/deleted before a stale alarm fired) by synthesizing a terminal
incident from the recovery-root request id — so a turn can never be dropped with
no terminal UX. -
#1623
4c8b371Thanks @threepointone! - Fix chat recovery falsely marking a durable submission aserrorunder repeated mid-turn deploys.When several deploys interrupt a single turn, recovery runs a chain of continuations. Three bugs combined to leave the submission in
erroreven when the turn actually completed every step:- Lost ownership. The submission link (
recoveredRequestId) was derived from each continuation's own (fresh) requestId, so chained continuations dropped it — the continuation that finally completed the turn could no longer mark the submissioncompleted. - Stale-continuation clobber. A superseded continuation tripped the
conversation_changedguard because the leaf had advanced via recovery's own forward progress (a new assistant message), not a new user turn, and overwrote the still-running submission toerror. - Premature
stable_timeout. A timeout while waiting for the isolate to settle (common while a deploy is in flight) failed the turn terminally at the first attempt.
Now: submission ownership is keyed off the stable recovery root and threaded through the entire continuation chain (including the terminal abandon paths — recovery exhaustion and
{ continue: false }— which previously marked the submission by the per-continuation requestId and so left a chained submission stuckrunning); a superseded continuation skips benignly (only a genuinely newer user turn marks the submissionskipped, nevererror); and a stable-state timeout reschedules within themaxAttemptsbudget. A turn that completes under deploy churn now endscompleted, noterror.@cloudflare/ai-chathas the same recovery machinery but no durable-submission layer, so it receives thestable_timeoutreschedule fix only: a transient stable-state timeout now retries within the attempt budget instead of permanently abandoning a recoverable turn at the first attempt. - Lost ownership. The submission link (
-
#1606
7419fbcThanks @threepointone! - Serialize client-tool continuation resumes so they do not overlap the active AI SDK chat request. -
#1640
edb126aThanks @threepointone! - Re-attach to a still-running sub-agent (agentTool()) run on parent recovery instead of abandoning and re-running it (#1630).When a parent agent was interrupted (deploy / Durable Object eviction) while a child
agentTool()run was still in flight, recovery marked the runinterruptedwithin a ~5s window and the parent re-issued the task — re-running the child's already-completed work. For long-running children under continuous deploys this surfaced to users as "the agent went all the way back and lost the files it already wrote."Three changes fix this:
- Stable child runId.
agentTool()now derives the childrunIdfrom the (recovery-preserved) tool call id (agent-tool:<toolCallId>) instead of minting a freshnanoidper call. A turn re-run by chat recovery now resolves to the same idempotent child facet rather than spawning a brand-new one, so completed child work is never re-run. - Bounded re-attach. A duplicate non-terminal
runId(inrunAgentTool) and a still-running child during startup reconciliation now tail the live child to its real terminal result and collect it, instead of immediately sealinginterrupted. Re-attach is bounded by a generous wall-clock budget (DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS, 120s, internal): a child that keeps advancing toward terminal within the window is collected; a genuinely hung child still sealsinterruptedso recovery can never block forever. - Durable child-run reconcile. A child facet self-heals its interrupted turn via its own
chatRecovery, but that recovery path never wrote the child's agent-tool run row — so after a real eviction the row strandedrunning(think) / was force-errored (ai-chat) and the parent could never collect the recovered result. Both@cloudflare/thinkand@cloudflare/ai-chatnow reconcile a stale child-run row from the durable transcript on inspect: while recovery is still resolving the row staysrunning; once it settles, a completed assistant response surfaces ascompleted(so the parent collects the real result) and an empty/failed recovery aserror. This keeps the child's own (working) recovery path untouched.
No new public configuration. Adds an internal
agent_tool:recovery:reattachobservability event.@cloudflare/thinkand@cloudflare/ai-chatchild tails are now read-only on consumer detach (a parent's re-attach budget expiring never cancels the still-running child). - Stable child runId.