github nesquena/hermes-webui v0.50.289
v0.50.289 — TCP keepalive on accepted connections

2 hours ago

v0.50.289 — TCP keepalive on accepted connections

1 PR by external contributor @happy5318. Closes #1580.

What's fixed

TCP keepalive on accepted connections to clean up dead CLOSE-WAIT sockets (#1581 by @happy5318; closes #1580)

Long-running Linux WebUI servers were accumulating CLOSE-WAIT zombie connections after clients crashed or lost their network without sending FIN. Without TCP keepalive enabled, threads blocked in recv() waiting for the next request had no way to detect the dead peer.

Fix: new Handler.setup() override in server.py that, on every accepted connection, sets:

  • SO_KEEPALIVE=1 (master switch — enables TCP keepalive on this socket)
  • TCP_NODELAY=1 (disables Nagle for HTTP small-burst latency)
  • TCP_KEEPIDLE=10 / TCP_KEEPINTVL=5 / TCP_KEEPCNT=3 (kernel starts probing a connection idle for 10s, probes every 5s, drops after 3 failed probes — ~25s detection)

Healthy SSE streams' existing 30s app-level : keepalive\n\n heartbeat resets the kernel idle timer well below the 10s threshold, so probes never fire on healthy long-lived connections — only genuinely idle keep-alive sockets that have lost their peer get cleaned up.

Cross-platform: graceful no-op on macOS/Windows where TCP_KEEP* constants raise AttributeError. Linux production target gets the full benefit. (See #1583 for follow-up to extend macOS coverage.)

Tests

4094 → 4094 passing — no new tests; kernel-level networking change is impractical to test reliably in unit suite without a multi-process integration fixture.

Pre-release verification

  • Independent reviewer (nesquena) APPROVED end-to-end.
  • Pre-release Opus advisor: SHIP AS-IS — no MUST-FIX. All verification questions cleared.
  • Full test suite: 4094 passed, 0 regressions.
  • Live verification post-deploy: ss -tnoe on production server shows timer:(keepalive,...) on accepted sockets, confirming SO_KEEPALIVE=1 is active on the server-side connection.

Maintainer in-stage actions

  • PR rebase (REBASE-DEFAULT rule): PR base was 111 commits behind origin/master (forked at 6c3ff3ff, pre-v0.50.275). Rebased onto current master (v0.50.288). Clean, no conflicts.

Known follow-ups (filed as #1583)

  1. QuietHTTPServer.server_bind() block contains harmless dead code (TCP_KEEP* without SO_KEEPALIVE on listening socket = no-op; redundant SO_REUSEADDR already set by parent class).
  2. macOS gets TCP_NODELAY only — TCP_KEEPIDLE AttributeError aborts the entire try block before SO_KEEPALIVE=1 is reached. Linux production target unaffected.

Both deferred to a small cleanup PR.


Thanks @happy5318 for the diagnosis and fix!

Full changelog: https://github.com/nesquena/hermes-webui/blob/master/CHANGELOG.md

Don't miss a new hermes-webui release

NewReleases is sending notifications on new releases.