windkh/node-red-contrib-telegrambot V17.4.5 on GitHub

V17.4.5 — auto-restart now actually rebuilds the HTTP keep-alive socket pool

Direct fix for the root cause of @petermeter69's observation on #442:

"Given the solid remote desktop connection, and the fact that the TG connection does not auto resume, it is obvious to me that it is the internal state of the node and not the connection quality that causes the failure."

He was right. V17.4.0's scheduleRestart claimed in its own code comment that "a successful create rebuilds the http.Agent so a stale keep-alive pool is replaced" — but that wasn't actually true.

What was happening

bot-node.js built the this.request options object once at config-node construction and reused the same object reference across every bot rebuild. For the non-SOCKS path it set agentOptions: { keepAlive: true } but did not set a pool field — and @cypress/request (which node-telegram-bot-api uses under the hood) treats a missing pool as "use my process-global pool". Looking at @cypress/request/request.js:689:

if (!self.pool[poolKey]) {
    self.pool[poolKey] = new Agent(options);
}
return self.pool[poolKey];

The agent is cached on the pool by a key derived from protocol + cert options. Same protocol + same options + same global pool = same agent instance before and after the rebuild. So when the network silently dropped a keep-alive socket (a half-open connect ETIMEDOUT shape), the same https.Agent kept handing that socket out, scheduleRestart rebuilt the bot on top of the same wedge, and the only way to clear it was to restart the whole Node-RED process so the global pool was rebuilt from scratch. Exactly the "manual redeploy is the only way" behaviour reported across #442's full history.

What changes in V17.4.5

Two small changes in bot-node.js:

1. Per-bot pool ownership. A new buildRequestOptions() method returns a request options object with a fresh pool: {} reference every call, and stores that reference on self.requestPool. The config-node constructor now calls this once instead of inlining the request shape. Each bot instance now owns its own agent pool instead of sharing the process-global one.

2. Explicit pool destruction on restart. A new destroyRequestPool() iterates self.requestPool and calls .destroy() on every cached agent — actively closing all idle keep-alive sockets and preventing reuse of in-flight ones. scheduleRestart's success path now calls destroyRequestPool() before re-creating the bot, then buildRequestOptions() to install a fresh empty pool. The close handler does the same so a redeploy doesn't leak sockets either.

Result: after scheduleRestart fires, the new bot starts with a genuinely empty socket pool. If the underlying network has recovered, the next request opens a brand-new TCP connection. If it hasn't, the next request fails honestly with a fresh error rather than silently reusing a zombie socket.

What V17.4.5 does NOT fix

The actual connect ETIMEDOUT / EAI_AGAIN / 502 Bad Gateway failures themselves are network-layer issues outside the plugin's reach. What changes is: the bot now genuinely recovers once the underlying network does, instead of staying wedged on a stale pool until manual redeploy. That's the gap petermeter69 was pointing at.

About the 17.3.0 → 17.4.4 chain

The fixes since 17.3.0 layered up:

17.3.0 removed the explicit polling-internals teardown in favour of the documented startPolling({restart:true}) soft-restart. Cleaner, but turned out to keep enough internal state to race a 409 Conflict on redeploy.
17.4.0 added auto-restart on fatal errors so the bot stops dying silently. Closed the "stays broken forever" path but introduced rapid-restart-at-3s.
17.4.1 suppressed duplicate "Bot error: ..." warns during outage bursts. Log volume, not behaviour.
17.4.2 added the 60-second stable window so the backoff curve actually escalates (3 s → 6 s → 12 s → 24 s → 48 s → 60 s → surrender). Was a real bug that #442's V17.4.0 retest surfaced.
17.4.3 rewrote Bot error: ... log lines via formatErrorChain so the leaf-level message (connect ETIMEDOUT 149.154.166.110:443) appears in the headline instead of the unhelpful AggregateError wrapper.
17.4.4 fixed the 409 Conflict loop on redeploy by detecting it in polling_error (skip restart, let the library's own retry handle it) and explicitly nulling _polling before startPolling({restart:true}).
17.4.5 (this release) addresses what petermeter69 was actually pointing at all along: the agent pool wasn't really being rebuilt.

Tests

222 passing (up from 215 in V17.4.4). 7 new mocha cases in test/nodes/bot-node-restart.test.js:

constructor wires up a per-bot requestPool object
buildRequestOptions returns a fresh pool object on each call
destroyRequestPool calls .destroy() on every agent and nulls the pool
destroyRequestPool with no pool set is a no-op (defensive)
non-SOCKS bot uses keepAlive agentOptions and a pool field
close handler destroys the request pool

The end-to-end "wedged-pool recovers after scheduleRestart" path is exercised indirectly — we don't yet have a way to fake a half-dead keep-alive socket in the mock-Telegram integration harness. petermeter69's setup will be the real test.

What to look for after upgrading

If the bot still goes silent after a network blip on V17.4.5 — that's now a genuinely new shape and worth opening a fresh issue for. The agent-pool hypothesis is closed.