github RunOnFlux/flux v7.2.0

11 hours ago

Summary

This PR introduces comprehensive improvements to the Syncthing global sync (g: flag / master-slave) functionality, addressing critical issues with mount safety after reboot, health monitoring, and
primary node election. It also fixes a race condition where loop device mounts failed on system boot due to encrypted volume unavailability.


Problems Identified and Fixed

  1. Critical: Data Loss Risk on Reboot (Race Condition)

Problem: After system reboot, @reboot cron jobs for mounting loop devices executed before the encrypted LUKS volume (/dat) was mounted, causing silent mount failures. This left app folders empty,
which could trigger Syncthing to propagate empty data to peers.

Evidence from logs:
Nov 17 18:03:58 CRON[1572]: (CRON) info (No MTA installed, discarding output)
Mounts failed silently because /dat wasn't ready when cron ran.

Solution:

  • Added wait logic to crontab mount commands: while [ ! -f ]; do sleep 5; done && sudo mount ...
  • Created crontabAndMountsCleanup service that runs on startup to:
    • Update old crontab entries to include wait logic
    • Remove stale entries for uninstalled apps
    • Verify and execute missing mounts immediately
  1. Syncthing Processing During Unmounted Folders

Problem: Syncthing monitor would process apps even when their loop device volumes weren't mounted, potentially causing data inconsistencies.

Solution:

  • Added mount safety verification in syncthingAppsCore() that skips all processing if any app folder exists but isn't properly mounted
  • Health monitor now skips folders that aren't mounted yet
  • Uses verifyFolderMountSafety() to detect empty unmounted directories (likely unmounted loop devices)
  1. Master-Slave Apps Not Respecting Primary Node

Problem: Global sync (g: flag) apps were being started/restarted on non-primary nodes, causing potential conflicts.

Solution: Added shouldStartGlobalSyncApp() function that checks if the current node is the primary before starting/restarting master-slave apps.

  1. Syncthing Apps Installing on Unreachable Nodes

Problem: Apps could be spawned on nodes with the same subdomain or nodes that weren't reachable, causing sync failures.

Solution: Added validation in appSpawner.js to check node reachability and avoid same-subdomain installations.


New Features

  1. Syncthing Health Monitor (syncthingHealthMonitor.js)
  • Monitors cluster health and connectivity
  • Detects isolated nodes (no peers connected)
  • Takes corrective actions for prolonged sync issues:
    • Warning after configurable threshold
    • Stop container after extended issues
    • Restart Syncthing service
    • Remove app locally if unrecoverable
  • Automatically restarts stopped apps when issues resolve
  • Comprehensive diagnostics and logging
  1. New Syncthing API Endpoints
  • getPeerSyncDiagnostics() - Cluster health diagnostics
  • Enhanced status monitoring capabilities
  1. Crontab and Mounts Cleanup Service (crontabAndMountsCleanup.js)
  • Runs 30 seconds after FluxOS startup
  • Migrates old crontab entries to new safe format
  • Removes entries for uninstalled apps
  • Ensures all installed apps have active mounts
  • If crontab update fails, automatically uninstalls the app and notifies peers

Files Changed

File Changes
syncthingHealthMonitor.js NEW - Complete health monitoring system (539 lines)
crontabAndMountsCleanup.js NEW - Startup cleanup service (397 lines)
syncthingMonitor.js Added mount safety checks, health monitoring integration
syncthingFolderStateMachine.js Mount safety verification functions
syncthingService.js New API endpoints for diagnostics
advancedWorkflows.js Wait logic for crontab mounts
appController.js Primary node checks for g: apps
appSpawner.js Node reachability validation
serviceManager.js Startup service integration
syncthingMonitorConstants.js Health check thresholds
globalState.js Health cache initialization
routes.js New API routes

Testing

  • Comprehensive unit tests for health monitor (1239 lines)
  • Manual testing on nodes with encrypted LUKS volumes
  • Verified mount behavior after reboot

Migration Notes

  • Existing crontab entries will be automatically updated on first FluxOS startup after deployment
  • Old entries without wait logic will be migrated to new format
  • Stale entries for removed apps will be cleaned up automatically

Risk Assessment

Low Risk:

  • All changes are additive or defensive
  • Automatic migration handles backward compatibility
  • Health monitor has conservative thresholds before taking action

Benefits:

  • Prevents critical data loss scenarios
  • Improves cluster stability
  • Self-healing capabilities for sync issues
  • Better observability with health monitoring

Don't miss a new flux release

NewReleases is sending notifications on new releases.