github nesquena/hermes-webui v0.50.284
v0.50.284 — P0 data-loss hotfix + stale-stream race

latest releases: v0.50.289, v0.50.288, v0.50.287...
4 hours ago

[v0.50.284] — 2026-05-03

Fixed (2 PRs — P0 streaming hotfix batch — closes #1533, #1558)

  • P0 data-loss hotfix: metadata-only Session.save() wipes conversation history (#1559, maintainer self-built; closes #1558) — Severity: P0. v0.50.279's _clear_stale_stream_state() (#1525) called save() on a session that may have been loaded with metadata_only=True. Session.save() writes self.messages to disk via atomic os.replace(), and metadata_only stubs synthesize messages=[]. Result: the on-disk session JSON was atomically replaced with an empty messages list. Every active conversation on v0.50.279 — v0.50.282 was at risk of being silently wiped on the next SSE reconnect after a server restart. Reported by a user on v0.50.282 ("getting weird issues with the latest updates… my prompt disappears… 1000+ message session disappeared too"). The "Reconnecting…" banner with a counter the user screenshotted was the observable symptom of the data being wiped — each cycle of the reconnect loop ran the data-loss code path. Three defensive layers + a startup self-heal: (1) Session.save() raises RuntimeError if _loaded_metadata_only=True — loud crash beats silent wipe; Session.load_metadata_only() sets the flag on the returned stub. (2) _clear_stale_stream_state() detects the metadata-only stub and reloads with metadata_only=False before mutating; if the reload fails, bails without clearing rather than wipe (correct asymmetry: better stale flag than wiped data). (3) Asymmetric backup — Session.save() writes <sid>.json.bak IFF the previous on-disk message count is greater than the incoming one (zero overhead on grow path; snapshot on any shrink). (4) Startup self-heal in new api/session_recovery.py module — on server start, scans session JSONs whose count is less than their .bak count and restores from .bak. Idempotent on clean state. The first server start after deploying v0.50.284 will auto-restore any session that was wiped between deploys. 6 new regression tests in tests/test_metadata_save_wipe_1558.py covering all four layers + idempotence. Pre-release independent reviewer (nesquena) APPROVED with one MUST-FIX (issue-number references #1557#1558) which was absorbed. Pre-release Opus advisor SHIP AS-IS with two SHOULD-FIX items absorbed in-release: (a) patch the caller's in-memory stub fields after a successful clear so /api/session doesn't briefly return stale active_stream_id, avoiding one ghost SSE reconnect; (b) atomic .bak write via tmp + os.replace() pattern matching the main file write — prevents a torn .bak from a crash mid-write.

  • Race fix: stale stream cleanup mutates outside the per-session lock (#1557, @dutchaiagency; closes #1533) — Opus advisor follow-up from v0.50.279. _clear_stale_stream_state() held STREAMS_LOCK only across the registry lookup; the write to session.active_stream_id = None happened after release. A concurrent _handle_chat_start on the same session could race: the reader thread could clobber a freshly-registered stream's session.active_stream_id, orphaning the new stream and forcing one user retry. Fix: wrap the mutate-and-save block in _get_session_agent_lock(session.session_id) and re-read active_stream_id inside the lock, bailing if it changed. New deterministic two-thread regression test test_stale_stream_cleanup_does_not_clobber_concurrent_chat_start. Effect was bounded (one user retry per race window, no data corruption), but the lock is the right shape and the contributor included an actual race test instead of asserting source shape.

Affected versions

  • v0.50.279 — first vulnerable to the P0 data-loss path
  • v0.50.280, v0.50.281, v0.50.282, v0.50.283 — also vulnerable
  • v0.50.284 — this release; fixes the data-loss path, ships startup self-heal so users wiped between deploys get auto-recovery on next launch, and closes the related stale-stream race

Maintainer in-stage fixes (test isolation)

  • tests/test_sprint29.py::test_valid_skill_accepted — now cleans up the test-security-skill it creates. Previously leaked into the test SKILLS_DIR and shifted what tests/test_sprint3.py::test_skills_* saw.
  • tests/test_sprint3.py::test_skills_content_known — picks the first skill from /api/skills rather than hardcoding dogfood, with pytest.skip on empty list (signal that a sibling test repointed the SKILLS_DIR).
  • tests/test_sprint3.py::test_skills_search_returns_subset — relax > 5 threshold to > 0, same skip-on-empty escape. Functional contract under test: API returns non-empty when there are skills to return.

Tests

4019 → 4026 passing (+7 net: +6 from #1559 P0 hotfix tests, +1 from #1557 race regression). 0 regressions. Full suite in 109s.

Pre-release verification

  • Stage merge: clean apart from the expected api/routes.py conflict (combined Layer 2 metadata-only reload + #1557 lock; resolved with metadata-only check FIRST so a stub never even acquires the agent lock).
  • Browser sanity (HTTP API checks against port 8789): 11 endpoints verified.
  • Pre-release Opus advisor: SHIP AS-IS — all 5 verification questions cleared (conflict-resolution order, deadlock risk none, Layer 3 backup interaction, startup self-heal vs concurrent saves, test-isolation fix correctness). Two SHOULD-FIX items absorbed in-release.

Don't miss a new hermes-webui release

NewReleases is sending notifications on new releases.