github RunOnFlux/flux v8.9.2

11 hours ago

Background

An encrypted enterprise syncthing app (openclawpro1774571881282) was installed on a node (92.170.17.66:16187). For syncthing apps with g: containerData, the installer correctly creates the container but does not start it — the container sits stopped while the syncthing state machine syncs data from the primary, then masterSlaveApps() starts it when ready.

During another app's installation, pruneContainers() ran and permanently deleted the stopped container. The node then spent 13+ hours broadcasting the app as "running" to the network with no Docker container, while masterSlaveApps() silently failed to start it every 30 seconds.

Root cause

The pruneContainers() guard in registerAppLocally (appInstaller.js:460-480) builds a list of installed app component names by iterating compose arrays to detect stopped containers. But installedApps() returns raw DB records — for encrypted enterprise apps, compose is [] because the specs are encrypted. So encrypted enterprise apps produce zero component names, the guard thinks there are no stopped apps, and pruneContainers() deletes the stopped container.

peerNotification.js:143 already calls decryptEnterpriseApps() before the same pattern. appInstaller.js did not.

Secondary issue

Master/slave (g:) apps were completely excluded from the stopped-app recovery loop in peerNotification.js (line 188: if (appDetails && !appInstalledMasterSlaveCheck)). When a master/slave app's container goes missing, it never hits the !containerExists check that would trigger recreateMissingContainers or removeAppLocally. The node just keeps trying to start a non-existent container and broadcasting it as running forever.

Changes

appInstaller.js — Call decryptEnterpriseApps() on the installed apps list before iterating compose arrays for the prune guard.

peerNotification.js — Add handleMissingMasterSlaveContainer() for master/slave apps with missing containers. A stopped container is normal (syncthing secondary) and left alone. A missing container triggers recreation via the existing recreateMissingContainers(), with:

  • Backup/restore awareness (skips if app is being backed up)
  • TOCTOU protection (if recreation fails but another process created the container, skip removal)
  • Fallback to removeAppLocally if recreation fails and container is still missing

Test plan

  • decryptEnterpriseApps called before prune guard component name iteration
  • Encrypted enterprise app with stopped container prevents pruneContainers from running
  • handleMissingMasterSlaveContainer returns early when container exists
  • Missing container triggers recreation, with fallback to app removal
  • TOCTOU: if another process creates container during recreation failure, skip removal
  • 21 unit tests passing across both files

Don't miss a new flux release

NewReleases is sending notifications on new releases.