🔧 Fix
Detect PID reuse in the worker start-guard so containers can restart cleanly. (#2082)
The kill(pid, 0) liveness check false-positived when the worker's PID file outlived its PID namespace — most commonly after docker stop / docker start with a bind-mounted ~/.claude-mem. The new worker would boot as the same low PID (often 11) as the old one, kill(0) would report "alive," and the worker would refuse to start against its own prior incarnation. Symptom: container appeared to start, immediately exited cleanly with no user-visible error, worker never came up.
What changed
- Capture an opaque process-start identity token alongside the PID and verify identity, not just liveness:
- Linux:
/proc/<pid>/statfield 22 (starttime in jiffies) — cheap, no exec, same signalpgrep/systemduse. - macOS / POSIX:
ps -p <pid> -o lstart=withLC_ALL=Cpinned so the emitted timestamp is locale-independent across environments. - Windows: unchanged — falls back to liveness-only. The PID-reuse scenario doesn't affect Windows deployments the way containers do.
- Linux:
verifyPidFileOwnershipemits a DEBUG log when liveness passes but the token mismatches, so the "PID reused" case is distinguishable from "process dead" in production logs.- PID files written by older versions are token-less;
verifyPidFileOwnershipfalls back to the existing liveness-only behavior for backwards compatibility. No migration required.
Surface
Shared helpers (PidInfo, captureProcessStartToken, verifyPidFileOwnership) live in src/supervisor/process-registry.ts and are re-exported from ProcessManager.ts to preserve the existing public surface. Both entry points updated: worker-service.ts GUARD 1 and supervisor/index.ts validateWorkerPidFile.
Tests
+14 new tests covering token capture, ownership verification, backwards compatibility for tokenless PID files, and the container-restart regression scenario. Zero regressions.