Summary
This PR introduces comprehensive improvements to the Syncthing global sync (g: flag / master-slave) functionality, addressing critical issues with mount safety after reboot, health monitoring, and
primary node election. It also fixes a race condition where loop device mounts failed on system boot due to encrypted volume unavailability.
Problems Identified and Fixed
- Critical: Data Loss Risk on Reboot (Race Condition)
Problem: After system reboot, @reboot cron jobs for mounting loop devices executed before the encrypted LUKS volume (/dat) was mounted, causing silent mount failures. This left app folders empty,
which could trigger Syncthing to propagate empty data to peers.
Evidence from logs:
Nov 17 18:03:58 CRON[1572]: (CRON) info (No MTA installed, discarding output)
Mounts failed silently because /dat wasn't ready when cron ran.
Solution:
- Added wait logic to crontab mount commands: while [ ! -f ]; do sleep 5; done && sudo mount ...
- Created crontabAndMountsCleanup service that runs on startup to:
- Update old crontab entries to include wait logic
- Remove stale entries for uninstalled apps
- Verify and execute missing mounts immediately
- Syncthing Processing During Unmounted Folders
Problem: Syncthing monitor would process apps even when their loop device volumes weren't mounted, potentially causing data inconsistencies.
Solution:
- Added mount safety verification in syncthingAppsCore() that skips all processing if any app folder exists but isn't properly mounted
- Health monitor now skips folders that aren't mounted yet
- Uses verifyFolderMountSafety() to detect empty unmounted directories (likely unmounted loop devices)
- Master-Slave Apps Not Respecting Primary Node
Problem: Global sync (g: flag) apps were being started/restarted on non-primary nodes, causing potential conflicts.
Solution: Added shouldStartGlobalSyncApp() function that checks if the current node is the primary before starting/restarting master-slave apps.
- Syncthing Apps Installing on Unreachable Nodes
Problem: Apps could be spawned on nodes with the same subdomain or nodes that weren't reachable, causing sync failures.
Solution: Added validation in appSpawner.js to check node reachability and avoid same-subdomain installations.
New Features
- Syncthing Health Monitor (syncthingHealthMonitor.js)
- Monitors cluster health and connectivity
- Detects isolated nodes (no peers connected)
- Takes corrective actions for prolonged sync issues:
- Warning after configurable threshold
- Stop container after extended issues
- Restart Syncthing service
- Remove app locally if unrecoverable
- Automatically restarts stopped apps when issues resolve
- Comprehensive diagnostics and logging
- New Syncthing API Endpoints
- getPeerSyncDiagnostics() - Cluster health diagnostics
- Enhanced status monitoring capabilities
- Crontab and Mounts Cleanup Service (crontabAndMountsCleanup.js)
- Runs 30 seconds after FluxOS startup
- Migrates old crontab entries to new safe format
- Removes entries for uninstalled apps
- Ensures all installed apps have active mounts
- If crontab update fails, automatically uninstalls the app and notifies peers
Files Changed
| File | Changes |
|---|---|
| syncthingHealthMonitor.js | NEW - Complete health monitoring system (539 lines) |
| crontabAndMountsCleanup.js | NEW - Startup cleanup service (397 lines) |
| syncthingMonitor.js | Added mount safety checks, health monitoring integration |
| syncthingFolderStateMachine.js | Mount safety verification functions |
| syncthingService.js | New API endpoints for diagnostics |
| advancedWorkflows.js | Wait logic for crontab mounts |
| appController.js | Primary node checks for g: apps |
| appSpawner.js | Node reachability validation |
| serviceManager.js | Startup service integration |
| syncthingMonitorConstants.js | Health check thresholds |
| globalState.js | Health cache initialization |
| routes.js | New API routes |
Testing
- Comprehensive unit tests for health monitor (1239 lines)
- Manual testing on nodes with encrypted LUKS volumes
- Verified mount behavior after reboot
Migration Notes
- Existing crontab entries will be automatically updated on first FluxOS startup after deployment
- Old entries without wait logic will be migrated to new format
- Stale entries for removed apps will be cleaned up automatically
Risk Assessment
Low Risk:
- All changes are additive or defensive
- Automatic migration handles backward compatibility
- Health monitor has conservative thresholds before taking action
Benefits:
- Prevents critical data loss scenarios
- Improves cluster stability
- Self-healing capabilities for sync issues
- Better observability with health monitoring