Patch Changes
-
#1742
4b201a9Thanks @threepointone! - Fix duplicated assistant text parts when a stream resume is replayed twice (#1733).The server intentionally sends
CF_AGENT_STREAM_RESUMINGfor the same request from bothonConnectand itsCF_AGENT_STREAM_RESUME_REQUESThandler. When both offers reached theuseAgentChatfallback path (e.g. the transport's resume handshake had already timed out), the client ACKed both, the full chunk buffer was replayed twice into the same accumulator, and the streaming reply rendered as two stacked text blocks until refresh.useAgentChatnow fallback-ACKs a given resume offer at most once per socket (reset on close/reconnect). A repeated offer is still handed to a waiting transport resume handshake first, so a fallback-observed stream can become transport-owned. It also resets the matching trailing assistant message on every replayed non-continuationstart, not only while the resume request id is still pending.- The shared broadcast stream state machine re-initializes its accumulator on a replayed
start, making replay idempotent under any number of replays. - Replay frames now carry
continuation: truefor continuation streams (persisted in stream metadata and restored after hibernation), so a replayed continuation appends to the existing assistant message instead of being mistaken for a fresh turn.
-
#1740
6c9de59Thanks @threepointone! - Defer one-shot scheduled callbacks (and chat-recovery give-ups) on platform transients instead of consuming them mid-deploy (#1730).A mid-execution Durable Object code-update reset surfaces storage failures in two shapes: the verbatim reset/supersede messages (already deferred) and
SqlError: SQL query failed: Network connection lost.— a wrapper that drops the CFretryableflag and dodges the reset matcher. The second shape burned the in-process retry budget inside the same few-seconds reset window (which outlasts the retry schedule by design) and then consumed the one-shot row on exhaustion, freezing the turn for minutes until incident re-detection — in the reported production capture, storage was healthy again 15 ms after the final attempt.agents— new cause-awareisPlatformTransientErrorclassifier (exported, alongsideisDurableObjectCodeUpdateReset): reset/supersede messages,retryable-flagged platform errors (excluding overloaded), and "Network connection lost.", looked up through wrappercausechains._executeScheduleCallbackkeeps in-process retries for connection-lost transients (a genuine blip heals fast) but on exhaustion of a one-shot row it now re-throws instead of swallowing, so the row survives and the alarm re-runs it in the healthy window that follows. Genuine application errors are still abandoned aftermaxAttemptsexactly as before.@cloudflare/think—_handleRecoveryCallbackErrornow defers (re-throws) on any platform transient instead of terminalizing through a give-up whose own seal needs the storage that is down; the bookkeeping write on the defer path is best-effort. The defer path no longer marks the recovered submissionerror(which made the deferred re-run skip withsubmission_not_running— a self-defeating defer); it staysrunningfor the re-run to pick up. The give-up now seals the incidentexhaustedonly after the terminal writes succeed, so a transient mid-seal defers the whole give-up for an idempotent re-run instead of half-sealing.@cloudflare/ai-chat— same give-up seal ordering: the incident is sealed only after_exhaustChatRecovery(incl. the durable terminal record) succeeds, so a transient mid-seal preserves the one-shot row and the give-up re-runs in full on a healthy isolate.
-
#1737
bc43133Thanks @cjol! - Fix the two remaining #1575 gaps in how in-band stream errors ({type: "error", errorText}chunks inside an otherwise-healthy provider stream) are observed after the fact.Errored-stream replay (partial content was lost on reconnect). A client reconnecting after an in-band error received the terminal error frame (#1645) but not the content the model streamed before the error — the replay path only served
status = 'completed'streams, so an errored stream's buffered chunks were unreachable, and the server pushes no messages on connect.ResumableStreamgainsreplayErroredChunksByRequestId, and the resume-ACK terminal replay (_replayTerminalOnAckin both AIChatAgent and Think) now replays the errored stream's stored chunks before thedone: true, error: trueframe, so a reconnecting client observes the same sequence a live client did. No wire-format or schema changes: replayed chunks reuse the existingreplay: trueframe shape and the error text still comes from the durable terminal record.Agent-tool error attribution (cross-run contamination). When an in-band error frame was broadcast on a child agent and the active run was unknown, the error was stamped onto every tailed run — so an unrelated turn's failure (or one of several overlapping runs) could mark healthy runs as
error, and capture depended on a tailer being attached at the right moment. Frames are now attributed by the request id they carry: each agent-tool run is bound to its turn's request id when the turn starts (persisted on the run row at start rather than at terminal, so attribution survives a DO restart mid-run), and only the owning run's error/progress state is updated. Frame inspection also no longer requires an attached tailer, so error capture is independent of tailer timing. -
#1712
835e7b0Thanks @threepointone! - Reclaim resumable-stream buffers from an alarm so idle chats don't leak storage (#1706)Resumable-stream chunk buffers (
cf_ai_chat_stream_*) were only swept lazily when a subsequent stream completed. A chat that received a single turn and then went idle never triggered that sweep, so its buffers lingered in the Durable Object's SQLite for the lifetime of the DO.AIChatAgentandThinknow arm a scheduled cleanup alarm whenever a stream starts and whenever it finishes (completes or errors). Arming on start guarantees that a stream whose DO is evicted mid-flight and never reaches a finish still gets a future sweep instead of leaking. This is the safety net for the non-durable path (e.g.chatRecovery: false, theAIChatAgentdefault): those turns don't run insiderunFiber, so there's no leftoverkeepAlivealarm and no fiber-recovery scan, and if the client never reconnects nothing else wakes the DO. (DurablerunFiberturns already self-heal — thekeepAlivealarm survives eviction, wakes the DO, and recovery finalizes the stream, which arms cleanup — so arming on start is belt-and-suspenders there.) The alarm sweeps aged buffers via the retention windows below and re-arms only while reclaimable rows remain, so a fully-swept DO stops waking itself. Arming is idempotent so high-turn-count chats never accumulate cleanup schedules; the in-callback re-arm uses a fresh (non-idempotent) row so it survives the one-shot deletion of the firing schedule. No per-turn Durable Object and no change to the session DO lifecycle are required.Retention is now split into two short, purpose-specific windows instead of a single 24h threshold: completed/errored buffers are kept for a brief 10-minute reconnect-and-replay grace (the assistant message is persisted separately, so the buffer is only needed to replay a just-finished stream or deliver a terminal error frame to a reconnecting client), while abandoned in-flight (
streaming) rows are kept for 1 hour so an interrupted turn has ample time to be resumed or recovered before its buffer is presumed dead. The abandoned-row sweep keys off last chunk activity rather than stream start time, so a long-running stream that is still emitting chunks is never reclaimed mid-flight.ResumableStreamgainscleanup(now?)(force a sweep, bypassing the lazy interval gate) andhasReclaimableStreams()to support alarm-driven cleanup.