RunOnFlux/flux v7.2.0 on GitHub

Summary

This PR introduces comprehensive improvements to the Syncthing global sync (g: flag / master-slave) functionality, addressing critical issues with mount safety after reboot, health monitoring, and
primary node election. It also fixes a race condition where loop device mounts failed on system boot due to encrypted volume unavailability.

Problems Identified and Fixed

Critical: Data Loss Risk on Reboot (Race Condition)

Problem: After system reboot, @reboot cron jobs for mounting loop devices executed before the encrypted LUKS volume (/dat) was mounted, causing silent mount failures. This left app folders empty,
which could trigger Syncthing to propagate empty data to peers.

Evidence from logs:
Nov 17 18:03:58 CRON[1572]: (CRON) info (No MTA installed, discarding output)
Mounts failed silently because /dat wasn't ready when cron ran.

Solution:

Added wait logic to crontab mount commands: while [ ! -f ]; do sleep 5; done && sudo mount ...
Created crontabAndMountsCleanup service that runs on startup to:
- Update old crontab entries to include wait logic
- Remove stale entries for uninstalled apps
- Verify and execute missing mounts immediately

Syncthing Processing During Unmounted Folders

Problem: Syncthing monitor would process apps even when their loop device volumes weren't mounted, potentially causing data inconsistencies.

Solution:

Added mount safety verification in syncthingAppsCore() that skips all processing if any app folder exists but isn't properly mounted
Health monitor now skips folders that aren't mounted yet
Uses verifyFolderMountSafety() to detect empty unmounted directories (likely unmounted loop devices)

Master-Slave Apps Not Respecting Primary Node

Problem: Global sync (g: flag) apps were being started/restarted on non-primary nodes, causing potential conflicts.

Solution: Added shouldStartGlobalSyncApp() function that checks if the current node is the primary before starting/restarting master-slave apps.

Syncthing Apps Installing on Unreachable Nodes

Problem: Apps could be spawned on nodes with the same subdomain or nodes that weren't reachable, causing sync failures.

Solution: Added validation in appSpawner.js to check node reachability and avoid same-subdomain installations.

New Features

Syncthing Health Monitor (syncthingHealthMonitor.js)

Monitors cluster health and connectivity
Detects isolated nodes (no peers connected)
Takes corrective actions for prolonged sync issues:
- Warning after configurable threshold
- Stop container after extended issues
- Restart Syncthing service
- Remove app locally if unrecoverable
Automatically restarts stopped apps when issues resolve
Comprehensive diagnostics and logging

New Syncthing API Endpoints

getPeerSyncDiagnostics() - Cluster health diagnostics
Enhanced status monitoring capabilities

Crontab and Mounts Cleanup Service (crontabAndMountsCleanup.js)

Runs 30 seconds after FluxOS startup
Migrates old crontab entries to new safe format
Removes entries for uninstalled apps
Ensures all installed apps have active mounts
If crontab update fails, automatically uninstalls the app and notifies peers

Files Changed

File	Changes
syncthingHealthMonitor.js	NEW - Complete health monitoring system (539 lines)
crontabAndMountsCleanup.js	NEW - Startup cleanup service (397 lines)
syncthingMonitor.js	Added mount safety checks, health monitoring integration
syncthingFolderStateMachine.js	Mount safety verification functions
syncthingService.js	New API endpoints for diagnostics
advancedWorkflows.js	Wait logic for crontab mounts
appController.js	Primary node checks for g: apps
appSpawner.js	Node reachability validation
serviceManager.js	Startup service integration
syncthingMonitorConstants.js	Health check thresholds
globalState.js	Health cache initialization
routes.js	New API routes

Testing

Comprehensive unit tests for health monitor (1239 lines)
Manual testing on nodes with encrypted LUKS volumes
Verified mount behavior after reboot

Migration Notes

Existing crontab entries will be automatically updated on first FluxOS startup after deployment
Old entries without wait logic will be migrated to new format
Stale entries for removed apps will be cleaned up automatically

Risk Assessment

Low Risk:

All changes are additive or defensive
Automatic migration handles backward compatibility
Health monitor has conservative thresholds before taking action

Benefits:

Prevents critical data loss scenarios
Improves cluster stability
Self-healing capabilities for sync issues
Better observability with health monitoring