Cancel cluster-crash fix
Fixes a server crash in worker cancellation.
pg_background_cancel(pid, cookie, grace_ms) (and the batch / by-label variants) previously escalated to kill(pid, SIGKILL) when a worker had not stopped within the grace window. PostgreSQL's postmaster treats any child that dies from an uncaught signal as a crash and responds with a cluster-wide restart (terminating every backend and reinitializing shared memory), so a single cancel could take down the whole server.
This was benign on fast machines — the worker almost always exited cooperatively within grace_ms, so the SIGKILL never fired — but caused a deterministic crash on slow or loaded hosts, observed on the Debian/pgdg PostgreSQL 18 build farm and reproduced locally under Valgrind, where a worker still in PostgreSQL's bgworker startup path when the grace timer expired was killed before it ever began executing. Because SIGKILL is uncatchable the worker logged nothing, which had previously been misdiagnosed as an upstream PG SIGSEGV.
What changed
- Removed the
SIGKILLescalation. Cancellation is now cooperative only:SIGTERMplus, forgrace_ms > 0, a bounded wait so the caller can observe whether the worker stopped. This matches the documented contract — cancel requests termination; it does not guarantee an immediate stop. - An unresponsive, CPU-bound worker that never reaches an interrupt check is not force-stopped; bound such work with
statement_timeoutor thepg_background.worker_timeoutGUC. - Docs corrected (README "Known Limitations §10", ARCHITECTURE.md) and the cancel SQL
COMMENT/ migration notes updated.
Upgrade
No SQL or API changes — the extension version stays 2.0. Upgrade by rebuilding/replacing pg_background.so (make install and restart sessions/workers); no ALTER EXTENSION step is required.
Validation
Standard installcheck passes on PG 18 with no expected-output change; the Valgrind run that previously crashed at the cancel test now completes cleanly (no abnormal worker exits — cancelled workers exit with code 1, never by signal).