V17.4.5 — auto-restart now actually rebuilds the HTTP keep-alive socket pool
Direct fix for the root cause of @petermeter69's observation on #442:
"Given the solid remote desktop connection, and the fact that the TG connection does not auto resume, it is obvious to me that it is the internal state of the node and not the connection quality that causes the failure."
He was right. V17.4.0's scheduleRestart claimed in its own code comment that "a successful create rebuilds the http.Agent so a stale keep-alive pool is replaced" — but that wasn't actually true.
What was happening
bot-node.js built the this.request options object once at config-node construction and reused the same object reference across every bot rebuild. For the non-SOCKS path it set agentOptions: { keepAlive: true } but did not set a pool field — and @cypress/request (which node-telegram-bot-api uses under the hood) treats a missing pool as "use my process-global pool". Looking at @cypress/request/request.js:689:
if (!self.pool[poolKey]) {
self.pool[poolKey] = new Agent(options);
}
return self.pool[poolKey];The agent is cached on the pool by a key derived from protocol + cert options. Same protocol + same options + same global pool = same agent instance before and after the rebuild. So when the network silently dropped a keep-alive socket (a half-open connect ETIMEDOUT shape), the same https.Agent kept handing that socket out, scheduleRestart rebuilt the bot on top of the same wedge, and the only way to clear it was to restart the whole Node-RED process so the global pool was rebuilt from scratch. Exactly the "manual redeploy is the only way" behaviour reported across #442's full history.
What changes in V17.4.5
Two small changes in bot-node.js:
1. Per-bot pool ownership. A new buildRequestOptions() method returns a request options object with a fresh pool: {} reference every call, and stores that reference on self.requestPool. The config-node constructor now calls this once instead of inlining the request shape. Each bot instance now owns its own agent pool instead of sharing the process-global one.
2. Explicit pool destruction on restart. A new destroyRequestPool() iterates self.requestPool and calls .destroy() on every cached agent — actively closing all idle keep-alive sockets and preventing reuse of in-flight ones. scheduleRestart's success path now calls destroyRequestPool() before re-creating the bot, then buildRequestOptions() to install a fresh empty pool. The close handler does the same so a redeploy doesn't leak sockets either.
Result: after scheduleRestart fires, the new bot starts with a genuinely empty socket pool. If the underlying network has recovered, the next request opens a brand-new TCP connection. If it hasn't, the next request fails honestly with a fresh error rather than silently reusing a zombie socket.
What V17.4.5 does NOT fix
The actual connect ETIMEDOUT / EAI_AGAIN / 502 Bad Gateway failures themselves are network-layer issues outside the plugin's reach. What changes is: the bot now genuinely recovers once the underlying network does, instead of staying wedged on a stale pool until manual redeploy. That's the gap petermeter69 was pointing at.
About the 17.3.0 → 17.4.4 chain
The fixes since 17.3.0 layered up:
- 17.3.0 removed the explicit polling-internals teardown in favour of the documented
startPolling({restart:true})soft-restart. Cleaner, but turned out to keep enough internal state to race a 409 Conflict on redeploy. - 17.4.0 added auto-restart on fatal errors so the bot stops dying silently. Closed the "stays broken forever" path but introduced rapid-restart-at-3s.
- 17.4.1 suppressed duplicate "Bot error: ..." warns during outage bursts. Log volume, not behaviour.
- 17.4.2 added the 60-second stable window so the backoff curve actually escalates (3 s → 6 s → 12 s → 24 s → 48 s → 60 s → surrender). Was a real bug that #442's V17.4.0 retest surfaced.
- 17.4.3 rewrote
Bot error: ...log lines viaformatErrorChainso the leaf-level message (connect ETIMEDOUT 149.154.166.110:443) appears in the headline instead of the unhelpfulAggregateErrorwrapper. - 17.4.4 fixed the 409 Conflict loop on redeploy by detecting it in
polling_error(skip restart, let the library's own retry handle it) and explicitly nulling_pollingbeforestartPolling({restart:true}). - 17.4.5 (this release) addresses what petermeter69 was actually pointing at all along: the agent pool wasn't really being rebuilt.
Tests
222 passing (up from 215 in V17.4.4). 7 new mocha cases in test/nodes/bot-node-restart.test.js:
- constructor wires up a per-bot
requestPoolobject buildRequestOptionsreturns a fresh pool object on each calldestroyRequestPoolcalls.destroy()on every agent and nulls the pooldestroyRequestPoolwith no pool set is a no-op (defensive)- non-SOCKS bot uses keepAlive
agentOptionsand apoolfield - close handler destroys the request pool
The end-to-end "wedged-pool recovers after scheduleRestart" path is exercised indirectly — we don't yet have a way to fake a half-dead keep-alive socket in the mock-Telegram integration harness. petermeter69's setup will be the real test.
What to look for after upgrading
If the bot still goes silent after a network blip on V17.4.5 — that's now a genuinely new shape and worth opening a fresh issue for. The agent-pool hypothesis is closed.