V17.4.2 — let the backoff escalate for sustained network problems
Patch release on top of V17.4.1 addressing the V17.4.0 retest of #442 by @petermeter69.
The problem
V17.4.0's auto-restart was firing correctly (the SLIGHTLYBETTEREFATAL traces in the issue are our patched FatalError flowing through scheduleRestart), but for sustained network outages it was oscillating at the minimum 3-second cadence instead of escalating through the exponential backoff curve I'd designed. From the user's perspective: the bot kept failing every few seconds, recovery felt unstable.
The cause
The success path of scheduleRestart was zeroing restartCount the instant getTelegramBot() returned a non-null bot:
if (bot) {
self.restartCount = 0; // <-- immediate reset
self.status = 'connected';
...
}But the rebuilt bot hadn't been verified to actually work yet. For a persistent connectivity problem with errors arriving every ~5 s:
T+0 error 1 → schedule restart, count=1, delay 3 s
T+3 restart fires → create succeeds → count = 0 ← reset too eagerly
T+5 error 2 → schedule restart, count=1, delay 3 s ← back to minimum
T+8 restart fires → create succeeds → count = 0
T+10 error 3 → ...
The exponential curve (3 s → 6 s → 12 s → ...) never got the chance to do its job.
The fix
scheduleRestart's success path now sets a 60-second restartStableTimer rather than resetting restartCount immediately. Three outcomes:
- Stable window completes (60 s with no fresh errors):
restartCountresets to 0. Next blip starts the curve over. - New error before the timer fires:
scheduleRestartclears the stable timer and treats the new error as a continuation.restartCountkeeps climbing through 6 s → 12 s → 24 s → 48 s → 60 s (capped). - Node closed: the timer is cleared along with the other pending timers in the close handler.
Net effect for sustained outages like petermeter69's: the bot now spaces attempts apart instead of hammering at minimum cadence — gives the network time to actually recover between tries.
Net effect for transient one-off blips: unchanged. Quick recovery, stable window completes, counter resets, ready for the next blip.
The 8-attempts-then-surrender ceiling is unchanged (~4.6 minutes of trying total in the worst case before logging "gave up").
Test coverage
Three new mocha cases in test/nodes/bot-node-restart.test.js:
- Error inside the stable window increments count instead of resetting.
- Error after the stable window has fired starts from count=0.
- Close handler clears the pending stable timer.
208 tests pass.
What to look at if errors persist
If you keep seeing SLIGHTLYBETTEREFATAL: AggregateError [ETIMEDOUT] on V17.4.2, the underlying TCP connection to Telegram's API is failing — that's below the plugin layer. From the comment on #411:
- Try setting
Address familyon the bot config to4— theAggregateErrorshape is dual-stack fingerprint, and forcing IPv4 eliminates it when IPv6 is the broken half. dig +short api.telegram.orgto confirm DNS is working.mtr -T -P 443 149.154.166.110from the affected host during an incident.
Also: the gave up restarting after fatal log line is now actually reachable for the first time (it was effectively unreachable on V17.4.0 / V17.4.1 because the counter kept resetting). If you see that line in your log, the bot has surrendered after 8 backoff attempts — you'll need to investigate the underlying connectivity.