ProxMenux v1.2.1.2 (Beta)
Second beta of the v1.2.1.x cycle. This release closes a series of
real-world issues surfaced after v1.2.1.1 shipped: a server outage
caused by a half-applied Log2Ram resize, ATA disk errors that escaped
the observation log, notifications that doubled up after a burst, and
known-error classifier matches that misread NVIDIA kernel messages as
SATA cable issues. It also extends the managed-installs registry
to the Coral TPU host driver — the gasket-dkms PCIe path and the
libedgetpu USB runtime are now both auto-detected and tracked for
upstream updates, and the installer gains a full uninstall flow
mirroring the NVIDIA one. The disk I/O severity model is replaced by
a sliding 24h window with proper warning / critical tiers so quiet
days stay quiet but a single hard error still pages immediately. The
Quiet Hours pipeline now buffers suppressed events to SQLite and
flushes them at the end of the window instead of silently dropping
them.
Main changes in v1.2.1.2
Coral TPU host driver — uninstall + update tracking
The Coral TPU installer (gpu_tpu/install_coral.sh) gains the same
two-action UX as the NVIDIA installer: on a host that already has
Coral installed (PCIe gasket-dkms, USB libedgetpu1-std/-max, or both)
the script now shows a menu offering Reinstall / update or
Uninstall. The uninstall path unloads the apex/gasket modules,
removes the DKMS registrations for every gasket version, purges the
gasket-dkms / libedgetpu1-std / libedgetpu1-max packages, cleans
up the udev rules, removes the apex system group when nobody else
is using it, and clears the Google Coral apt repo. It is idempotent:
missing pieces are no-ops, never errors.
In parallel, the Coral driver is now a first-class entry in the
managed-installs registry. The detector enumerates both variants
(PCIe → installed gasket-dkms version or DKMS-registered build,
USB → libedgetpu1-std/-max apt version) and the checker queries
the right upstream for each: feranick/gasket-driver tags on GitHub
for PCIe and apt-cache policy for the USB runtime. When a newer
version is available, the same notification pipeline that already
powers the NVIDIA-driver update message fires a
coral_driver_update_available event — one per variant, so a host
with both M.2 and USB Coral devices gets independent update streams.
Disk I/O severity tiers
The disk_io detector no longer treats every dmesg ATA/SCSI error the
same. It now keeps a 24h sliding window of error timestamps per
device and decides severity from the combination of error type,
rate-per-window and the SMART health verdict:
- silent — 0–10 errors/24h, SMART PASSED, no hard error pattern.
The observation is recorded for the disk's history but no
notification fires. - WARNING — 11–100 errors/24h on the same device.
- CRITICAL — 100+ errors/24h, or SMART FAILED, or any hard
error likeBuffer I/O error,UNC,Medium Error,
Unrecovered read error, orSense Key Hardware Error. A single
hard error always pages immediately, regardless of rate.
The 24h same-disk cooldown is keyed by (disk, severity_tier), so a
warning → critical escalation produces a second notification even
within the cooldown window. The pre-existing return-before-record
bug is also fixed: observations are now written to
disk_observations before the SMART gate, so transient errors
that don't yet trip SMART still build the per-disk history.
Quiet Hours — events are no longer dropped
Previously, non-CRITICAL events fired during a channel's Quiet Hours
window were silently dropped. They are now buffered to SQLite (a new
quiet_pending table mirroring the existing digest_pending) and
released as a single grouped summary at the moment the window
closes. CRITICAL still bypasses Quiet Hours and is delivered live.
The channel's quiet_* and digest_* fields are also now returned
by get_settings() so the toggle state correctly reloads on a page
refresh.
log2ram — apply path actually applies
The post-install auto/update flow used to write the new tmpfs size to
/etc/log2ram.conf but didn't restart the log2ram daemon, so a
configured 512M stayed at the original 128M until the operator
restarted manually. The .1.10 server crashed during a backup because
of exactly this. The flow now reloads systemd, restarts log2ram and
rsyslog, and re-runs log2ram clean / write, so the new size
takes effect on the running tmpfs immediately.
VM/CT control — real errors instead of bare 500s
Failed pvesh start/stop/restart operations now surface the actual
pvesh stderr (e.g. "no space left on device", "Configuration file 'nodes/.../qemu-server/100.conf' does not exist") instead of a
bare 500 INTERNAL SERVER ERROR. The backend also fires a
vm_fail / ct_fail notification on every failed control action so
the operator sees it on Telegram even if they were not looking at
the dashboard at that moment.
Notification pipeline — four correctness fixes
- Known-error classifier no longer misreads kernel messages. The
proxmox_known_errorsregexata.*error(case-insensitive) was
matchingf**atal**Er**ror**inside NVIDIA's
nvidia_uvm: Unknown symbol nvUvmInterfaceReportFatalErrorand
labelling it as "ATA communication error with disk — check SATA
cables". The pattern is now anchored:\bata\d.*\berror\brequires
a word boundary beforeataand a digit immediately after (as in
ata1,ata2.00), so it still catches real ATA logs but no
longer matches embedded substrings.UNCis also word-bounded. - Burst summaries don't double-count. When the first event of a
burst was already sent individually on the fast-alert path, the
burst summary kept reporting the total count, which made the
operator see "1 system problem" individual + "2 system problems"
burst within seconds. Burst summaries now report the additional
count ("+N more X in window — N+1 total"), and the wording on
every burst template is updated to match. - Health journal context no longer pastes unrelated lines. The
AI's "📝 Log:" footer in degraded-health notifications was filling
with whatever systemd happened to log within the same 10-minute
window — including the Monitor's own watchdog killing a stuck
subprocess. The grep prefilter now excludes
proxmenux-monitor.serviceso self-logs never leak into the
context of an unrelated event. - "Resolved" notifications report the severity the user actually
saw. Previously, if an error fired WARNING, silently escalated
to CRITICAL during its 24h same-key cooldown, and then resolved,
the recovery message read "previous severity: CRITICAL" — a value
the operator had never seen in any notification. The recovery now
uses the severity that was actually delivered at notification
time, not whatever later value the DB ended up holding.
Frontend polish
- Quiet Hours and Daily Digest time inputs redesigned for mobile.
The cramped two-column grid that overflowed on narrow screens is
replaced by inline labels and full-height time pickers. - 401 cascade recovery on the Login screen — the dedup flag is
cleared on mount and on a successful login so a brief stale-token
state no longer leaves the user staring at a blank dashboard
forever. - API error parsing — the frontend
fetchApiwrapper now
extracts{error}/{message}from the JSON body before
throwing, so toast messages on VM control failures, scripts,
notifications and similar endpoints show the real backend reason
instead ofAPI request failed: 500 INTERNAL SERVER ERROR.
Thanks again to the testers who keep finding the edge cases — the
SIGKILL-leaking-into-the-wrong-event report and the
fatalError-classified-as-ATA report were both reproduced live and
fixed in this beta. Feedback is welcome on the same channels.