v1.0.128

Closes 2 community-reported lifecycle bugs from @ishabana — both Windows-observed but cross-OS. The fixes compose: #559 prevents the leak that produces #560's contention. Single release, ship together.

What broke

#559 — `/ctx-upgrade` leaves zombie MCP child processes

@ishabana on Windows 11 reported every /ctx-upgrade invocation accumulates a new MCP server process while the old one keeps running. After 5 instances (2 from upgrade leaks + 3 from a separate user misconfig: registering as both plugin AND npx), they hit #560.

Root cause verified: src/cli.ts:746 upgrade() runs npm install -g context-mode@latest but never signals sibling MCP processes. src/lifecycle.ts:126 shutdown() only fires on parent death / SIGTERM / stdin-EOF — none trigger on npm install -g. Cross-OS bug, not Windows-specific. Empirical verification on this Mac during diagnosis: 5 PIDs across 3 versions (1.0.113, 1.0.117, 1.0.126) running concurrently, oldest 8h22m alive.

Fix architecture (Slice 1-4):

New helper src/util/sibling-mcp.ts — cross-OS sibling MCP discovery (pgrep -f on POSIX, PowerShell Get-CimInstance Win32_Process on Windows — NOT deprecated wmic). Discovery regex catches both ~/.claude/plugins/cache/context-mode/context-mode/*/start.mjs AND ~/.claude/plugins/marketplaces/context-mode/start.mjs shapes. Skips process.pid AND process.ppid (parent claude process).
SIGTERM → 1500ms wait → SIGKILL escalation — graceful first, force second. Inlined in helper using process.kill(pid, sig) rather than reusing executor.ts:95 killProcessTree (that helper requires a ChildProcess ref, sibling discovery yields bare PIDs).
Wired into cli.ts upgrade() at line ~880 — between version-compare and npm install. Earliest possible signal so npm install of the new version doesn't race the old process. If discovery throws (weird Windows envs), catches and continues — never blocks the upgrade.
Human-readable summary log: "Stopped N sibling MCP servers" (suppressed when count = 0 to avoid noise).

#560 — Multiple server instances cause unbounded WAL growth + query hangs

Same reporter, same env. With multiple servers on the same SQLite DB, WAL auto-checkpoint never gets a writer-free window — every write appends indefinitely. They observed a 238MB WAL and every ctx_search hanging.

Root cause verified: src/db-base.ts:325 applyWALPragmas set only journal_mode=WAL, synchronous=NORMAL, mmap_size=256MB. NO locking_mode, NO startup mutex, NO PID lockfile. Multi-instance collision was completely unguarded.

Fix architecture (Slice 5-7) — belt-and-braces, two layers of defense:

Layer 1: PID lockfile (src/util/db-lock.ts) — <dbPath>.lock written via fs.writeFileSync(path, String(process.pid), { flag: 'wx' }) for atomic create. On EEXIST: read existing PID, check liveness (isProcessAlive pattern copied inline from store.ts:187 — deliberate to avoid db-base → store coupling per architect's design choice). If alive, throws @ishabana's verbatim error: "Another context-mode server is already running (PID: XXXX). Stop it before starting a new instance.". If dead, atomic claim with race-resolution re-read.
Layer 2: SQLite locking_mode = EXCLUSIVE — added to applyWALPragmas. Even if lockfile is bypassed (filesystem race, stale FS state), SQLite itself blocks the second writer.
Critical guardrail: both Layer 1 + Layer 2 SKIP when dbPath.startsWith(tmpdir()). The defaultDBPath for tests creates PID-scoped tmp DBs — without skip-gate, every spawned test subprocess would deadlock.
Wired into SQLiteBase ctor (db-base.ts:469-520) — single change covers both SessionDB and ContentStore subclasses (when ContentStore migrates to SQLiteBase; out of scope for this release).
Cleanup composes: lifecycle.ts shutdown() → closeDB() → releaseDbLock(). The _liveDBs global Set was upgraded to a Map (with versioned symbol __context_mode_live_dbs_v2__ to prevent stale-module crash) so the exit hook can iterate [dbPath, instance] pairs and release lockfiles deterministically.

Why these compose

#559's SIGTERM gives the old process time for lifecycle.ts shutdown() → closeDB() → releaseDbLock() to run cleanly. If SIGKILL escalates (graceful timeout), #560's stale-PID detection self-heals on next start. Splitting the fixes would create regression windows where one would partially fail without the other.

Tests

25 new tests across 3 EXISTING test files (CONTRIBUTING L275 — zero new test files):

tests/executor.test.ts: +7 tests (sibling discovery POSIX/Win parsing, kill escalation, ESRCH swallow)
tests/cli/upgrade-verifies-binding.test.ts: +3 tests (kill called before npm install, kill summary suppression, discovery error never blocks upgrade)
tests/util/db-base-platform-gate.test.ts: +15 tests (O_EXCL atomicity, live-PID rejection, stale-lockfile claim, race resolution, tmpdir skip, idempotent release, EXCLUSIVE pragma, runtime double-open rejection, lifecycle composition: A holds → B blocked → A.close → B succeeds)

Per Mert's "only run new tests" directive — 25/25 PASS, full suite skipped to avoid time waste. npm run typecheck PASS after every slice.

Compatibility

15 adapters / 3 OS. Both fixes are core lifecycle (NOT adapter-specific) — adapter #16 inherits both protections automatically. No interface changes. New helpers (sibling-mcp.ts, db-lock.ts) are universal — they fire on the canonical plugin path layout, not per-adapter logic.

Adapter	Effect of v1.0.128
All 15 (universal)	`/ctx-upgrade` now kills sibling MCP processes before npm install. Multi-instance DB access blocked at lockfile + SQLite-EXCLUSIVE level.

Upgrade

npm install -g context-mode@latest
# inside Claude Code:
/ctx-upgrade
# observe stderr: "Stopped N sibling MCP servers" if leaks existed
# fully restart Claude Code (Cmd+Q + reopen — /reload-plugins doesn't cycle MCP children)

If you've been hit by accumulated zombie processes from prior versions, this is the first upgrade that actively cleans them up. Verify with pgrep -f context-mode.*start.mjs (POSIX) or Get-Process node | Where-Object CommandLine -Match context-mode (Windows) — should show 1 process per active Claude session, not N processes per upgrade.

Thanks

@ishabana for two staff-grade reports — clean reproducer for #559 (ps aux | grep context-mode), precise impact analysis for #560 (238MB WAL + 5-instance cause-effect chain), AND two concrete fix proposals (Option A locking_mode + Option B PID lockfile) for #560. We shipped both as belt-and-braces. The 5-instance accumulation pattern made the architectural class visible — without your data, this would have stayed silent until the next user hit it harder.

mksglu/context-mode v1.0.128 on GitHub