cloudflare/agents @cloudflare/ai-chat@0.8.5 on GitHub

Patch Changes

#1742 4b201a9 Thanks @threepointone! - Fix duplicated assistant text parts when a stream resume is replayed twice (#1733).
The server intentionally sends CF_AGENT_STREAM_RESUMING for the same request from both onConnect and its CF_AGENT_STREAM_RESUME_REQUEST handler. When both offers reached the useAgentChat fallback path (e.g. the transport's resume handshake had already timed out), the client ACKed both, the full chunk buffer was replayed twice into the same accumulator, and the streaming reply rendered as two stacked text blocks until refresh.
- useAgentChat now fallback-ACKs a given resume offer at most once per socket (reset on close/reconnect). A repeated offer is still handed to a waiting transport resume handshake first, so a fallback-observed stream can become transport-owned. It also resets the matching trailing assistant message on every replayed non-continuation start, not only while the resume request id is still pending.
- The shared broadcast stream state machine re-initializes its accumulator on a replayed start, making replay idempotent under any number of replays.
- Replay frames now carry continuation: true for continuation streams (persisted in stream metadata and restored after hibernation), so a replayed continuation appends to the existing assistant message instead of being mistaken for a fresh turn.
#1740 6c9de59 Thanks @threepointone! - Defer one-shot scheduled callbacks (and chat-recovery give-ups) on platform transients instead of consuming them mid-deploy (#1730).
A mid-execution Durable Object code-update reset surfaces storage failures in two shapes: the verbatim reset/supersede messages (already deferred) and SqlError: SQL query failed: Network connection lost. — a wrapper that drops the CF retryable flag and dodges the reset matcher. The second shape burned the in-process retry budget inside the same few-seconds reset window (which outlasts the retry schedule by design) and then consumed the one-shot row on exhaustion, freezing the turn for minutes until incident re-detection — in the reported production capture, storage was healthy again 15 ms after the final attempt.
- agents — new cause-aware isPlatformTransientError classifier (exported, alongside isDurableObjectCodeUpdateReset): reset/supersede messages, retryable-flagged platform errors (excluding overloaded), and "Network connection lost.", looked up through wrapper cause chains. _executeScheduleCallback keeps in-process retries for connection-lost transients (a genuine blip heals fast) but on exhaustion of a one-shot row it now re-throws instead of swallowing, so the row survives and the alarm re-runs it in the healthy window that follows. Genuine application errors are still abandoned after maxAttempts exactly as before.
- @cloudflare/think — _handleRecoveryCallbackError now defers (re-throws) on any platform transient instead of terminalizing through a give-up whose own seal needs the storage that is down; the bookkeeping write on the defer path is best-effort. The defer path no longer marks the recovered submission error (which made the deferred re-run skip with submission_not_running — a self-defeating defer); it stays running for the re-run to pick up. The give-up now seals the incident exhausted only after the terminal writes succeed, so a transient mid-seal defers the whole give-up for an idempotent re-run instead of half-sealing.
- @cloudflare/ai-chat — same give-up seal ordering: the incident is sealed only after _exhaustChatRecovery (incl. the durable terminal record) succeeds, so a transient mid-seal preserves the one-shot row and the give-up re-runs in full on a healthy isolate.
#1737 bc43133 Thanks @cjol! - Fix the two remaining #1575 gaps in how in-band stream errors ({type: "error", errorText} chunks inside an otherwise-healthy provider stream) are observed after the fact.
Errored-stream replay (partial content was lost on reconnect). A client reconnecting after an in-band error received the terminal error frame (#1645) but not the content the model streamed before the error — the replay path only served status = 'completed' streams, so an errored stream's buffered chunks were unreachable, and the server pushes no messages on connect. ResumableStream gains replayErroredChunksByRequestId, and the resume-ACK terminal replay (_replayTerminalOnAck in both AIChatAgent and Think) now replays the errored stream's stored chunks before the done: true, error: true frame, so a reconnecting client observes the same sequence a live client did. No wire-format or schema changes: replayed chunks reuse the existing replay: true frame shape and the error text still comes from the durable terminal record.
Agent-tool error attribution (cross-run contamination). When an in-band error frame was broadcast on a child agent and the active run was unknown, the error was stamped onto every tailed run — so an unrelated turn's failure (or one of several overlapping runs) could mark healthy runs as error, and capture depended on a tailer being attached at the right moment. Frames are now attributed by the request id they carry: each agent-tool run is bound to its turn's request id when the turn starts (persisted on the run row at start rather than at terminal, so attribution survives a DO restart mid-run), and only the owning run's error/progress state is updated. Frame inspection also no longer requires an attached tailer, so error capture is independent of tailer timing.
#1712 835e7b0 Thanks @threepointone! - Reclaim resumable-stream buffers from an alarm so idle chats don't leak storage (#1706)
Resumable-stream chunk buffers (cf_ai_chat_stream_*) were only swept lazily when a subsequent stream completed. A chat that received a single turn and then went idle never triggered that sweep, so its buffers lingered in the Durable Object's SQLite for the lifetime of the DO.
AIChatAgent and Think now arm a scheduled cleanup alarm whenever a stream starts and whenever it finishes (completes or errors). Arming on start guarantees that a stream whose DO is evicted mid-flight and never reaches a finish still gets a future sweep instead of leaking. This is the safety net for the non-durable path (e.g. chatRecovery: false, the AIChatAgent default): those turns don't run inside runFiber, so there's no leftover keepAlive alarm and no fiber-recovery scan, and if the client never reconnects nothing else wakes the DO. (Durable runFiber turns already self-heal — the keepAlive alarm survives eviction, wakes the DO, and recovery finalizes the stream, which arms cleanup — so arming on start is belt-and-suspenders there.) The alarm sweeps aged buffers via the retention windows below and re-arms only while reclaimable rows remain, so a fully-swept DO stops waking itself. Arming is idempotent so high-turn-count chats never accumulate cleanup schedules; the in-callback re-arm uses a fresh (non-idempotent) row so it survives the one-shot deletion of the firing schedule. No per-turn Durable Object and no change to the session DO lifecycle are required.
Retention is now split into two short, purpose-specific windows instead of a single 24h threshold: completed/errored buffers are kept for a brief 10-minute reconnect-and-replay grace (the assistant message is persisted separately, so the buffer is only needed to replay a just-finished stream or deliver a terminal error frame to a reconnecting client), while abandoned in-flight (streaming) rows are kept for 1 hour so an interrupted turn has ample time to be resumed or recovered before its buffer is presumed dead. The abandoned-row sweep keys off last chunk activity rather than stream start time, so a long-running stream that is still emitting chunks is never reclaimed mid-flight.
ResumableStream gains cleanup(now?) (force a sweep, bypassing the lazy interval gate) and hasReclaimableStreams() to support alarm-driven cleanup.