Two related fixes for #442 — the polling-side wedge that's been the long-running symptom on this issue.
petermeter69's 29 May log paste flipped the diagnosis: every entry during his 15-minute wedge is a polling_error event (3-line pattern: leaf message, RequestError dump, "Polling error --> Trying again."); zero Bot error:, zero will restart in Xms, zero gave up restarting lines. That means scheduleRestart — and with it the V17.4.5 per-bot agent-pool rebuild that was supposed to be the root-cause fix for this issue — was never called during the actual outage. The bot's 'error' event listener doesn't fire on polling failures (only on a handful of webhook/event paths), so polling failures bypass the rebuild entirely. The polling-side recovery (stopPolling+restartPolling on the same _polling instance) keeps the HTTP keep-alive agent pool intact, so zombie sockets accumulate after a network drop and the lib's polling state machine gets thrashed by overlapping stop/start cycles.
(1) Polling-burst circuit breaker
Modelled on the existing 409-Conflict breaker. 5 polling errors within 60 s trips an escalation from the cheap restartPolling to a full scheduleRestart, which abortBots the bot, destroyRequestPools the agent pool, and constructs a fresh bot via createTelegramBot. The pool gets genuinely rebuilt on the polling code path for the first time. Transient single-event blips (petermeter69's first event on 27 May was 3 errors in 28 s — below threshold) stay on the cheap path; only sustained bursts escalate.
(2) Remove the give-up cap on scheduleRestart
Prerequisite for (1) to retry indefinitely under a sustained outage, and a fix in its own right. V17.4.12 and earlier capped at 8 attempts (~4.5 min) and then logged Bot ... gave up restarting after fatal: ... and returned permanently — leaving the operator with the same recourse as not retrying at all (manual redeploy) while removing every chance of automatic recovery when the network eventually came back. The helper now keeps retrying at the 60 s ceiling forever; one node.error fires the first time the ceiling is reached (~90 s into a sustained outage, after the exponential ramp 3+6+12+24+48 ≈ 90 s is consumed) so the operator still gets one actionable alert without being spammed every minute.
Tests
8 new mocha cases (3 for the give-up removal, 5 for the polling-burst breaker). 239 passing.
Closes #442 (pending production confirmation).