[v0.50.284] — 2026-05-03
Fixed (2 PRs — P0 streaming hotfix batch — closes #1533, #1558)
-
P0 data-loss hotfix: metadata-only Session.save() wipes conversation history (#1559, maintainer self-built; closes #1558) — Severity: P0. v0.50.279's
_clear_stale_stream_state()(#1525) calledsave()on a session that may have been loaded withmetadata_only=True.Session.save()writesself.messagesto disk via atomicos.replace(), andmetadata_onlystubs synthesizemessages=[]. Result: the on-disk session JSON was atomically replaced with an empty messages list. Every active conversation on v0.50.279 — v0.50.282 was at risk of being silently wiped on the next SSE reconnect after a server restart. Reported by a user on v0.50.282 ("getting weird issues with the latest updates… my prompt disappears… 1000+ message session disappeared too"). The "Reconnecting…" banner with a counter the user screenshotted was the observable symptom of the data being wiped — each cycle of the reconnect loop ran the data-loss code path. Three defensive layers + a startup self-heal: (1)Session.save()raisesRuntimeErrorif_loaded_metadata_only=True— loud crash beats silent wipe;Session.load_metadata_only()sets the flag on the returned stub. (2)_clear_stale_stream_state()detects the metadata-only stub and reloads withmetadata_only=Falsebefore mutating; if the reload fails, bails without clearing rather than wipe (correct asymmetry: better stale flag than wiped data). (3) Asymmetric backup —Session.save()writes<sid>.json.bakIFF the previous on-disk message count is greater than the incoming one (zero overhead on grow path; snapshot on any shrink). (4) Startup self-heal in newapi/session_recovery.pymodule — on server start, scans session JSONs whose count is less than their.bakcount and restores from.bak. Idempotent on clean state. The first server start after deploying v0.50.284 will auto-restore any session that was wiped between deploys. 6 new regression tests intests/test_metadata_save_wipe_1558.pycovering all four layers + idempotence. Pre-release independent reviewer (nesquena) APPROVED with one MUST-FIX (issue-number references #1557 → #1558) which was absorbed. Pre-release Opus advisor SHIP AS-IS with two SHOULD-FIX items absorbed in-release: (a) patch the caller's in-memory stub fields after a successful clear so/api/sessiondoesn't briefly return staleactive_stream_id, avoiding one ghost SSE reconnect; (b) atomic.bakwrite viatmp + os.replace()pattern matching the main file write — prevents a torn.bakfrom a crash mid-write. -
Race fix: stale stream cleanup mutates outside the per-session lock (#1557, @dutchaiagency; closes #1533) — Opus advisor follow-up from v0.50.279.
_clear_stale_stream_state()heldSTREAMS_LOCKonly across the registry lookup; the write tosession.active_stream_id = Nonehappened after release. A concurrent_handle_chat_starton the same session could race: the reader thread could clobber a freshly-registered stream'ssession.active_stream_id, orphaning the new stream and forcing one user retry. Fix: wrap the mutate-and-save block in_get_session_agent_lock(session.session_id)and re-readactive_stream_idinside the lock, bailing if it changed. New deterministic two-thread regression testtest_stale_stream_cleanup_does_not_clobber_concurrent_chat_start. Effect was bounded (one user retry per race window, no data corruption), but the lock is the right shape and the contributor included an actual race test instead of asserting source shape.
Affected versions
- v0.50.279 — first vulnerable to the P0 data-loss path
- v0.50.280, v0.50.281, v0.50.282, v0.50.283 — also vulnerable
- v0.50.284 — this release; fixes the data-loss path, ships startup self-heal so users wiped between deploys get auto-recovery on next launch, and closes the related stale-stream race
Maintainer in-stage fixes (test isolation)
tests/test_sprint29.py::test_valid_skill_accepted— now cleans up thetest-security-skillit creates. Previously leaked into the test SKILLS_DIR and shifted whattests/test_sprint3.py::test_skills_*saw.tests/test_sprint3.py::test_skills_content_known— picks the first skill from/api/skillsrather than hardcodingdogfood, withpytest.skipon empty list (signal that a sibling test repointed the SKILLS_DIR).tests/test_sprint3.py::test_skills_search_returns_subset— relax> 5threshold to> 0, same skip-on-empty escape. Functional contract under test: API returns non-empty when there are skills to return.
Tests
4019 → 4026 passing (+7 net: +6 from #1559 P0 hotfix tests, +1 from #1557 race regression). 0 regressions. Full suite in 109s.
Pre-release verification
- Stage merge: clean apart from the expected
api/routes.pyconflict (combined Layer 2 metadata-only reload + #1557 lock; resolved with metadata-only check FIRST so a stub never even acquires the agent lock). - Browser sanity (HTTP API checks against port 8789): 11 endpoints verified.
- Pre-release Opus advisor: SHIP AS-IS — all 5 verification questions cleared (conflict-resolution order, deadlock risk none, Layer 3 backup interaction, startup self-heal vs concurrent saves, test-isolation fix correctness). Two SHOULD-FIX items absorbed in-release.