Fix: Prevent chroma-mcp spawn storm (PR #1065)
Fixes a critical bug where killing the worker daemon during active sessions caused 641 chroma-mcp Python processes to spawn in ~5 minutes, consuming 75%+ CPU and ~64GB virtual memory.
Root Cause
ChromaSync.ensureConnection() had no connection mutex. Concurrent fire-and-forget syncObservation() calls from multiple sessions raced through the check-then-act guard, each spawning a chroma-mcp subprocess via StdioClientTransport. Error-driven reconnection created a positive feedback loop.
5-Layer Defense
| Layer | Mechanism | Purpose |
|---|---|---|
| 0 | Connection mutex via promise memoization | Coalesces concurrent callers onto a single spawn attempt |
| 1 | Pre-spawn process count guard (execFileSync('ps'))
| Kills excess chroma-mcp processes before spawning new ones |
| 2 | Hardened close() with try-finally + Unix pkill -P fallback
| Guarantees state reset even on error, kills orphaned children |
| 3 | Count-based orphan reaper in ProcessManager
| Kills by count (not age), catches spawn storms where all processes are young |
| 4 | Circuit breaker (3 failures → 60s cooldown) | Stops error-driven reconnection positive feedback loop |
Additional Fix
- Process guards now use
etime-based sorting instead of PID ordering for reliable age determination (PIDs wrap and don't guarantee ordering)
Testing
- 16 new tests for mutex, circuit breaker, close() hardening, and count guard
- All tests pass (947 pass, 3 skip)
Closes #1063, closes #695. Relates to #1010, #707.
Contributors: @rodboev