ProxMenux v1.2.2

Stable consolidation of the v1.2.1.x beta cycle. Four prereleases of feature work and fixes land together on the stable channel: a much more configurable Health Monitor (per-category thresholds, per-event dismiss durations, an audit log of active suppressions), a notification stack that reaches ~80 services through Apprise and persists events across Quiet Hours, automatic update detection for LXC containers, and a long list of operator-visible fixes — HTTPS terminal handshakes, kernel update detection on PVE 9.x, NVIDIA installer flow on Alpine, and a quieter Monitor process on idle hosts.

✨ What's new

Health Monitor — more configurable, more granular

Three pieces that together let an operator dial the Health Monitor to their environment instead of working around its defaults.

Per-category thresholds. Every check the Health Monitor runs is parameterised by a pair of numbers — a Warning threshold and a Critical threshold — and both are now exposed in Settings → Health Monitor Thresholds. A homelab with a single-disk SSD may want to page earlier on capacity (75 / 90 %), a datacentre host with redundant Ceph nodes can be more relaxed on memory warnings (90 % is normal under ZFS ARC), a passively-cooled mini-PC needs lower temperature thresholds than a server with forced-air cooling. The same numbers also feed the colour ranges of the dashboard widgets — the temperature line in the disk-temperature modal, the bars on the storage cards, the chips on the CPU / memory panels — so the visual classification always matches what actually triggered the alert.
Per-event dismiss duration. The Dismiss button on every Health Monitor alert now opens a small dropdown with three options: 24 hours, 7 days or Permanently. The 24h / 7d paths behave like the previous time-limited dismiss; Permanently persists the alert with suppression_hours = -1, never re-emits, and is marked with a distinct amber Permanent badge so the operator can always see which alerts are intentionally silenced. POST /api/health/acknowledge accepts an optional suppression_hours body field for this; omitting it preserves the previous behaviour (the category's configured default applies).
Active Suppressions panel. A new section inside Settings → Health Monitor, right below the per-category suppression durations, lists every currently-silenced alert — both time-limited dismisses (with a remaining-time badge like 22h remaining) and permanent ones. Each row carries the error_key, category, severity, the timestamp the alert was dismissed, plus a Re-enable button that clears the acknowledgment so the alert can fire again on the next scan. The Re-enable action is gated by the Health Monitor Edit mode at the top of the card and is committed alongside any per-category changes on Save. Permanent dismisses can only be reverted from here — the dashboard intentionally does not expose a per-alert un-dismiss affordance.
Disk I/O severity tiers. Sliding 24 h window classifies dmesg ATA / SCSI errors into silent (0–10), WARNING (11–100) and CRITICAL (100+ or any hard error like UNC / Buffer I/O / Sense Key Hardware Error), so quiet days stay quiet and a single Buffer I/O event still pages immediately.

Notifications — Apprise, Quiet Hours buffering, AI rework

Apprise notification channel. One Apprise URL talks to ~80 services (Pushover, ntfy, Slack, Matrix, mailto, signal, Pushbullet, Mattermost, ...) without ProxMenux needing a dedicated adapter for each. The Apprise tab now exposes full feature parity with the native channels: the same Notification Categories block, per-event sub-toggles, Quiet Hours and Daily Digest controls as Telegram, Gotify, Discord and Email. The backend already supported per-channel filtering for Apprise via the generic channel_overrides block; the UI just wasn't surfacing it.
Quiet Hours buffering. Events suppressed during a channel's quiet window are now persisted to SQLite and released as a grouped summary when the window closes, instead of being silently dropped.
AI Enhancement, redesigned. The AI Enhancement subsection in Notifications was rewritten from a muted uppercase row that testers consistently scrolled past, to a normal-case foreground label with a leading Sparkles icon and a persistent badge (green Active when AI is enabled, neutral Optional when it isn't) so the feature is visible regardless of state.

Container updates and tooling

LXC update detection. A new dedicated section in Settings (between Health Monitor Thresholds and Notifications) with a single toggle that gates the per-CT apt list --upgradable / apk list -u scan end-to-end. Default ON. When OFF the scan stops entirely (no pct exec calls), every type=lxc entry is purged from the managed-installs registry immediately, and the matching notification toggle in Notifications → Services disappears from the UI while preserving its stored preference. The checker also reads the mtime of the CT's package-manager metadata cache and refreshes it via pct exec if it's older than 24 h — a Debian 12 CT with a 524-day-old cache went from "0 updates" to "117 (12 security)" on lab hardware.
Post-install function update detection. The Monitor tracks installed ProxMenux optimizations (Log2Ram, Memory Settings, System Limits, Logrotate, ...) and notifies when a newer version of any of them is available, with one-click apply from Settings.

Hardware support

NVIDIA driver update notifications. Kernel-aware detection of newer compatible driver versions, surfaced in the Hardware tab and as notifications when an upstream build is published.
Coral TPU installer. Uninstall path mirroring the NVIDIA flow, and registry-driven update notifications for both the PCIe gasket-dkms driver (tracked against feranick/gasket-driver) and the USB libedgetpu1 runtime (tracked via apt).
Secure Gateway (Tailscale) updates. One-click Tailscale update from Settings with Last-checked / Installed / Latest indicators and notification when a new version is available.

Other improvements

Helper-Scripts menu — richer context. Each entry now ships more useful information so it's easier to know what every script does before running it.
Disk temperature monitoring — improved readings, smarter caching across SMART probes, redesigned history modal opening at 24 h by default with min / avg / max statistics.
VM and LXC modal — expanded so a single panel covers the data you previously had to look up across multiple tabs.
Page load — faster first paint and lighter network usage on the Overview, Storage and Hardware tabs.
Security tightening — tighter authentication checks across notification, scripts and terminal endpoints, plus a more conservative default policy for new installs.

🛠️ Notable fixes

Terminal modals on HTTPS hosts — every terminal modal (dashboard terminal, LXC terminal, script terminal) used to fail with WebSocket connection error on hosts with HTTPS enabled. Root cause: the gevent + SSL path stacked geventwebsocket's WebSocketHandler on top of flask-sock's protocol implementation, so the server emitted two consecutive HTTP/1.1 101 Switching Protocols headers and the browser closed the connection as a corrupt frame. Dropping handler_class=WebSocketHandler restores a single 101 response and the handshake completes normally.
Health Monitor kernel updates on PVE 9.x (#208) — the System Updates → Kernel / PVE row reported "Kernel/PVE up to date" on PVE 9.x hosts even when an update for the running kernel was waiting upstream. Three combined fixes: (a) the kernel-package prefix list now includes proxmox-kernel-* and proxmox-firmware-* (PVE 9.x ships kernels under proxmox-kernel-, not pve-kernel- as in 7.x / 8.x), (b) the dry-run switched from apt-get upgrade --dry-run to apt-get dist-upgrade --dry-run so kernel updates packaged as new installs are visible at all, (c) the categoriser reads uname -r and flags an update as a running-kernel update when the package matches the running release. The row now distinguishes "Running kernel update available (reboot required)" from "N kernel update(s) available (none for running kernel)".
NVIDIA installer kernel compatibility, Alpine LXC and NVENC — the version menu now respects the running kernel compatibility window, only offering driver branches that won't fail to compile. Container-side userspace install reworked so it succeeds on Alpine hosts, and free-space detection works reliably across all storage layouts. When the host has the NVENC patch applied, the version menu narrows to drivers supported by the patch so reinstalling never silently loses it.
Apprise integration hardening — three independent fixes:
- Mobile overflow on narrow viewports in the Apprise URL row (placeholder reduced to a single concise example, input wrapper enforces min-w-0 / flex-1 / shrink-0, examples paragraph uses break-all min-w-0).
- Backend whitelist rejecting Apprise with HTTP 400. The notifications-test validator's hard-coded channel set ({telegram, gotify, discord, email, all}) was missing apprise, so every Apprise test or send returned 400 Invalid channel before the library was even invoked. The whitelist is now derived live from notification_channels.CHANNEL_TYPES so adding a new channel cannot silently regress this validator again.
- Apprise error reporting. When a destination (jsons://, ntfy://, slack://, ...) returns a non-2xx response, the channel now captures Apprise's internal logger during notify() and surfaces the real HTTP status plus the destination's response body (capped at 300 chars) instead of the opaque "Apprise rejected the notification (transport failure)" message.
fail2ban-client subprocess storm — the cache wrapper around _f2b_get_banned_ips() only updated its timestamp on success, so on hosts where fail2ban-client returned ENOENT (binary not installed) the function fell through the cache check on every single HTTP request and fired 250+ failed execve calls in a 10-minute window. shutil.which('fail2ban-client') is now resolved once at module load and the cache timestamp is updated unconditionally.
smartctl scheduler collision — disk SMART temperature polling, CPU temperature read and latency probe used to fire at the same offset within each minute, producing a measurable CPU / IO spike when all subprocesses spawned together. The polls are now staggered (latency, then CPU temperature, then disk SMART) while preserving the per-disk 60 s cadence.
LXC inventory subprocess — the mount monitor used to call lxc-info -n <vmid> -p for every running CT just to get its PID. It now reads /proc/<lxc-start-pid>/task/<lxc-start-pid>/children directly and falls back to lxc-info only when /proc reads fail, eliminating one subprocess per CT per scan cycle.
Browser-translated terminal pages — the terminal panel used to lose its WebSocket connection when the user enabled the browser's auto-translate feature, because the translator moved DOM nodes that React still held refs to. Added translate="no" on the terminal container divs so the translator skips the embedded tty entirely.
Active Suppressions UX — re-enables are now queued (green border + strike-through on the row + button label changes to Undo) and applied atomically when the user clicks Save, alongside any per-category dropdown changes. The list also refreshes automatically when an alert is dismissed from the dashboard while the Settings page is already open, via a health-suppression-changed browser event plus listeners on window focus and document visibilitychange.
Minor stability — ATA disk errors are now recorded in disk_observations before the SMART gate (transient errors that don't yet trip SMART still build the per-disk history); the Quiet Hours toggle persists correctly after a refresh; the Login screen no longer swallows a 401 forever after a brief stale-token state; PVE webhook URLs follow the active SSL state automatically; log2ram restarts after a configured size change.

⬆️ Upgrading from v1.2.1

ProxMenux notifies stable users automatically on the next menu launch. The Monitor service restarts in-place — no host reboot is needed for the upgrade itself. If you were running a 1.2.1.x beta, the same menu flow detects that you are now on the published stable channel and offers to switch you off the beta installer.

If you customised any Health Monitor settings before upgrading, they are preserved verbatim — the new Health Monitor Thresholds panel adds new defaults but does not overwrite existing values. The per-category suppression durations you had configured continue to apply as the default when a per-event Dismiss is fired without an explicit window choice.

🙏 Acknowledgments

This release would not look the way it does without the contributions and feedback from the community. Special thanks to:

Code contributors

@jcastro landed five direct improvements that ship with v1.2.2:

Select VM ISOs from all ISO storages — new shared helper scripts/global/iso_storage_helpers.sh plus integration in vm_creator.sh, select_linux_iso.sh and select_windows_iso.sh, so the ISO picker now reads from every storage tagged as ISO content instead of being pinned to local. Commit 092b548d.
Release channel switcher in Settings — a proper menu under scripts/menus/config_menu.sh to flip between the stable and beta install channels in-place, with the right version.txt / beta_version.txt handling on each side. Commit f8a8c43d.
ZFS autotrim in the auto post-install — auto_post_install.sh now enables autotrim=on on root ZFS pools by default (with the matching disable in the uninstall path), so SSD-backed installs reclaim freed space without manual intervention. Commit 8877f987.
Webhook loopback detection + update handoff — flask_notification_routes.py correctly classifies 127.0.0.1 / localhost webhooks as loopback, and the menu script's update handoff no longer flakes on edge cases. Commit 70ab072c.
Figurine bumped to 2.0.0 — banner tool refresh in customizable_post_install.sh, with the doc page updated to match. Commit aba94028.

@pespinel fixed a beta-installer regression that broke service paths after the move to the new runtime layout — install_proxmenux_beta.sh now resolves the right systemd unit paths on first install and on update. Commit 0daab74a.

Field reports that shaped the GPU & Coral work

@ghosthvj's detailed reports and suggestions on the hardware passthrough flow drove the round of improvements that ship in v1.2.2 for the three GPU scripts:

scripts/gpu_tpu/nvidia_installer.sh — kernel-aware version menu, Alpine LXC userspace support, NVENC-patch awareness, uninstall feedback, free-space detection fixes
scripts/gpu_tpu/switch_gpu_mode.sh — orphan audio cascade on detach, precise hostpci regex, vfio.conf cascade extension (the full GPU + audio companion lifecycle hardening described in the GPU + Audio Passthrough section above)
scripts/gpu_tpu/add_gpu_vm.sh — iGPU audio-companion checklist on attach, two-pass scan that protects the HDMI audio of other dGPUs left in the VM

Coral TPU on LXC — latest upstream drivers

The Coral installer for LXC (scripts/gpu_tpu/install_coral_lxc.sh) was rewritten end-to-end to install the latest upstream gasket-dkms driver and libedgetpu1 runtime (220 lines added, 150 removed). Coral M.2 / mPCIe modules that previously failed on PVE 9 kernels now install and bind cleanly, and the registry-driven update notifications introduced in v1.2.1.2 keep both packages fresh going forward.

Everyone else

A huge thank you to every user who opened an issue, commented in GitHub Discussions, reported a bug on the community channel, or just stopped by to say what worked and what didn't on their hardware. Most of the internal improvements in this release — the smartctl scheduler stagger, the fail2ban cache fix, the lxc-info /proc replacement, the HTTPS terminal handshake, the kernel-update detection on PVE 9.x, the Apprise wiring — started as a report from somebody running into the issue. Keep them coming.

MacRimi/ProxMenux v1.2.2 on GitHub