v1.0.128
Closes 2 community-reported lifecycle bugs from @ishabana — both Windows-observed but cross-OS. The fixes compose: #559 prevents the leak that produces #560's contention. Single release, ship together.
What broke
#559 — /ctx-upgrade leaves zombie MCP child processes
@ishabana on Windows 11 reported every /ctx-upgrade invocation accumulates a new MCP server process while the old one keeps running. After 5 instances (2 from upgrade leaks + 3 from a separate user misconfig: registering as both plugin AND npx), they hit #560.
Root cause verified: src/cli.ts:746 upgrade() runs npm install -g context-mode@latest but never signals sibling MCP processes. src/lifecycle.ts:126 shutdown() only fires on parent death / SIGTERM / stdin-EOF — none trigger on npm install -g. Cross-OS bug, not Windows-specific. Empirical verification on this Mac during diagnosis: 5 PIDs across 3 versions (1.0.113, 1.0.117, 1.0.126) running concurrently, oldest 8h22m alive.
Fix architecture (Slice 1-4):
- New helper
src/util/sibling-mcp.ts— cross-OS sibling MCP discovery (pgrep -fon POSIX, PowerShellGet-CimInstance Win32_Processon Windows — NOT deprecatedwmic). Discovery regex catches both~/.claude/plugins/cache/context-mode/context-mode/*/start.mjsAND~/.claude/plugins/marketplaces/context-mode/start.mjsshapes. Skipsprocess.pidANDprocess.ppid(parent claude process). - SIGTERM → 1500ms wait → SIGKILL escalation — graceful first, force second. Inlined in helper using
process.kill(pid, sig)rather than reusingexecutor.ts:95 killProcessTree(that helper requires a ChildProcess ref, sibling discovery yields bare PIDs). - Wired into
cli.ts upgrade()at line ~880 — between version-compare andnpm install. Earliest possible signal so npm install of the new version doesn't race the old process. If discovery throws (weird Windows envs), catches and continues — never blocks the upgrade. - Human-readable summary log: "Stopped N sibling MCP servers" (suppressed when count = 0 to avoid noise).
#560 — Multiple server instances cause unbounded WAL growth + query hangs
Same reporter, same env. With multiple servers on the same SQLite DB, WAL auto-checkpoint never gets a writer-free window — every write appends indefinitely. They observed a 238MB WAL and every ctx_search hanging.
Root cause verified: src/db-base.ts:325 applyWALPragmas set only journal_mode=WAL, synchronous=NORMAL, mmap_size=256MB. NO locking_mode, NO startup mutex, NO PID lockfile. Multi-instance collision was completely unguarded.
Fix architecture (Slice 5-7) — belt-and-braces, two layers of defense:
- Layer 1: PID lockfile (
src/util/db-lock.ts) —<dbPath>.lockwritten viafs.writeFileSync(path, String(process.pid), { flag: 'wx' })for atomic create. OnEEXIST: read existing PID, check liveness (isProcessAlivepattern copied inline fromstore.ts:187— deliberate to avoiddb-base → storecoupling per architect's design choice). If alive, throws @ishabana's verbatim error:"Another context-mode server is already running (PID: XXXX). Stop it before starting a new instance.". If dead, atomic claim with race-resolution re-read. - Layer 2: SQLite
locking_mode = EXCLUSIVE— added toapplyWALPragmas. Even if lockfile is bypassed (filesystem race, stale FS state), SQLite itself blocks the second writer. - Critical guardrail: both Layer 1 + Layer 2 SKIP when
dbPath.startsWith(tmpdir()). ThedefaultDBPathfor tests creates PID-scoped tmp DBs — without skip-gate, every spawned test subprocess would deadlock. - Wired into
SQLiteBasector (db-base.ts:469-520) — single change covers bothSessionDBandContentStoresubclasses (when ContentStore migrates to SQLiteBase; out of scope for this release). - Cleanup composes: lifecycle.ts
shutdown()→closeDB()→releaseDbLock(). The_liveDBsglobal Set was upgraded to a Map (with versioned symbol__context_mode_live_dbs_v2__to prevent stale-module crash) so the exit hook can iterate[dbPath, instance]pairs and release lockfiles deterministically.
Why these compose
#559's SIGTERM gives the old process time for lifecycle.ts shutdown() → closeDB() → releaseDbLock() to run cleanly. If SIGKILL escalates (graceful timeout), #560's stale-PID detection self-heals on next start. Splitting the fixes would create regression windows where one would partially fail without the other.
Tests
25 new tests across 3 EXISTING test files (CONTRIBUTING L275 — zero new test files):
tests/executor.test.ts: +7 tests (sibling discovery POSIX/Win parsing, kill escalation, ESRCH swallow)tests/cli/upgrade-verifies-binding.test.ts: +3 tests (kill called before npm install, kill summary suppression, discovery error never blocks upgrade)tests/util/db-base-platform-gate.test.ts: +15 tests (O_EXCL atomicity, live-PID rejection, stale-lockfile claim, race resolution, tmpdir skip, idempotent release, EXCLUSIVE pragma, runtime double-open rejection, lifecycle composition: A holds → B blocked → A.close → B succeeds)
Per Mert's "only run new tests" directive — 25/25 PASS, full suite skipped to avoid time waste. npm run typecheck PASS after every slice.
Compatibility
15 adapters / 3 OS. Both fixes are core lifecycle (NOT adapter-specific) — adapter #16 inherits both protections automatically. No interface changes. New helpers (sibling-mcp.ts, db-lock.ts) are universal — they fire on the canonical plugin path layout, not per-adapter logic.
| Adapter | Effect of v1.0.128 |
|---|---|
| All 15 (universal) | /ctx-upgrade now kills sibling MCP processes before npm install. Multi-instance DB access blocked at lockfile + SQLite-EXCLUSIVE level.
|
Upgrade
npm install -g context-mode@latest
# inside Claude Code:
/ctx-upgrade
# observe stderr: "Stopped N sibling MCP servers" if leaks existed
# fully restart Claude Code (Cmd+Q + reopen — /reload-plugins doesn't cycle MCP children)If you've been hit by accumulated zombie processes from prior versions, this is the first upgrade that actively cleans them up. Verify with pgrep -f context-mode.*start.mjs (POSIX) or Get-Process node | Where-Object CommandLine -Match context-mode (Windows) — should show 1 process per active Claude session, not N processes per upgrade.
Thanks
@ishabana for two staff-grade reports — clean reproducer for #559 (ps aux | grep context-mode), precise impact analysis for #560 (238MB WAL + 5-instance cause-effect chain), AND two concrete fix proposals (Option A locking_mode + Option B PID lockfile) for #560. We shipped both as belt-and-braces. The 5-instance accumulation pattern made the architectural class visible — without your data, this would have stayed silent until the next user hit it harder.