github MacRimi/ProxMenux v1.2.1.2-beta

pre-release7 hours ago

ProxMenux logo ProxMenux v1.2.1.2 (Beta)

Second beta of the v1.2.1.x cycle. This release closes a series of
real-world issues surfaced after v1.2.1.1 shipped: a server outage
caused by a half-applied Log2Ram resize, ATA disk errors that escaped
the observation log, notifications that doubled up after a burst, and
known-error classifier matches that misread NVIDIA kernel messages as
SATA cable issues. It also extends the managed-installs registry
to the Coral TPU host driver — the gasket-dkms PCIe path and the
libedgetpu USB runtime are now both auto-detected and tracked for
upstream updates, and the installer gains a full uninstall flow
mirroring the NVIDIA one. The disk I/O severity model is replaced by
a sliding 24h window with proper warning / critical tiers so quiet
days stay quiet but a single hard error still pages immediately. The
Quiet Hours pipeline now buffers suppressed events to SQLite and
flushes them at the end of the window instead of silently dropping
them.


Main changes in v1.2.1.2

Coral TPU host driver — uninstall + update tracking

The Coral TPU installer (gpu_tpu/install_coral.sh) gains the same
two-action UX as the NVIDIA installer: on a host that already has
Coral installed (PCIe gasket-dkms, USB libedgetpu1-std/-max, or both)
the script now shows a menu offering Reinstall / update or
Uninstall. The uninstall path unloads the apex/gasket modules,
removes the DKMS registrations for every gasket version, purges the
gasket-dkms / libedgetpu1-std / libedgetpu1-max packages, cleans
up the udev rules, removes the apex system group when nobody else
is using it, and clears the Google Coral apt repo. It is idempotent:
missing pieces are no-ops, never errors.

In parallel, the Coral driver is now a first-class entry in the
managed-installs registry. The detector enumerates both variants
(PCIe → installed gasket-dkms version or DKMS-registered build,
USB → libedgetpu1-std/-max apt version) and the checker queries
the right upstream for each: feranick/gasket-driver tags on GitHub
for PCIe and apt-cache policy for the USB runtime. When a newer
version is available, the same notification pipeline that already
powers the NVIDIA-driver update message fires a
coral_driver_update_available event — one per variant, so a host
with both M.2 and USB Coral devices gets independent update streams.

Disk I/O severity tiers

The disk_io detector no longer treats every dmesg ATA/SCSI error the
same. It now keeps a 24h sliding window of error timestamps per
device and decides severity from the combination of error type,
rate-per-window and the SMART health verdict:

  • silent — 0–10 errors/24h, SMART PASSED, no hard error pattern.
    The observation is recorded for the disk's history but no
    notification fires.
  • WARNING — 11–100 errors/24h on the same device.
  • CRITICAL — 100+ errors/24h, or SMART FAILED, or any hard
    error
    like Buffer I/O error, UNC, Medium Error,
    Unrecovered read error, or Sense Key Hardware Error. A single
    hard error always pages immediately, regardless of rate.

The 24h same-disk cooldown is keyed by (disk, severity_tier), so a
warning → critical escalation produces a second notification even
within the cooldown window. The pre-existing return-before-record
bug is also fixed: observations are now written to
disk_observations before the SMART gate, so transient errors
that don't yet trip SMART still build the per-disk history.

Quiet Hours — events are no longer dropped

Previously, non-CRITICAL events fired during a channel's Quiet Hours
window were silently dropped. They are now buffered to SQLite (a new
quiet_pending table mirroring the existing digest_pending) and
released as a single grouped summary at the moment the window
closes. CRITICAL still bypasses Quiet Hours and is delivered live.
The channel's quiet_* and digest_* fields are also now returned
by get_settings() so the toggle state correctly reloads on a page
refresh.

log2ram — apply path actually applies

The post-install auto/update flow used to write the new tmpfs size to
/etc/log2ram.conf but didn't restart the log2ram daemon, so a
configured 512M stayed at the original 128M until the operator
restarted manually. The .1.10 server crashed during a backup because
of exactly this. The flow now reloads systemd, restarts log2ram and
rsyslog, and re-runs log2ram clean / write, so the new size
takes effect on the running tmpfs immediately.

VM/CT control — real errors instead of bare 500s

Failed pvesh start/stop/restart operations now surface the actual
pvesh stderr (e.g. "no space left on device", "Configuration file 'nodes/.../qemu-server/100.conf' does not exist") instead of a
bare 500 INTERNAL SERVER ERROR. The backend also fires a
vm_fail / ct_fail notification on every failed control action so
the operator sees it on Telegram even if they were not looking at
the dashboard at that moment.

Notification pipeline — four correctness fixes

  • Known-error classifier no longer misreads kernel messages. The
    proxmox_known_errors regex ata.*error (case-insensitive) was
    matching f**atal**Er**ror** inside NVIDIA's
    nvidia_uvm: Unknown symbol nvUvmInterfaceReportFatalError and
    labelling it as "ATA communication error with disk — check SATA
    cables". The pattern is now anchored: \bata\d.*\berror\b requires
    a word boundary before ata and a digit immediately after (as in
    ata1, ata2.00), so it still catches real ATA logs but no
    longer matches embedded substrings. UNC is also word-bounded.
  • Burst summaries don't double-count. When the first event of a
    burst was already sent individually on the fast-alert path, the
    burst summary kept reporting the total count, which made the
    operator see "1 system problem" individual + "2 system problems"
    burst within seconds. Burst summaries now report the additional
    count ("+N more X in window — N+1 total"), and the wording on
    every burst template is updated to match.
  • Health journal context no longer pastes unrelated lines. The
    AI's "📝 Log:" footer in degraded-health notifications was filling
    with whatever systemd happened to log within the same 10-minute
    window — including the Monitor's own watchdog killing a stuck
    subprocess. The grep prefilter now excludes
    proxmenux-monitor.service so self-logs never leak into the
    context of an unrelated event.
  • "Resolved" notifications report the severity the user actually
    saw.
    Previously, if an error fired WARNING, silently escalated
    to CRITICAL during its 24h same-key cooldown, and then resolved,
    the recovery message read "previous severity: CRITICAL" — a value
    the operator had never seen in any notification. The recovery now
    uses the severity that was actually delivered at notification
    time, not whatever later value the DB ended up holding.

Frontend polish

  • Quiet Hours and Daily Digest time inputs redesigned for mobile.
    The cramped two-column grid that overflowed on narrow screens is
    replaced by inline labels and full-height time pickers.
  • 401 cascade recovery on the Login screen — the dedup flag is
    cleared on mount and on a successful login so a brief stale-token
    state no longer leaves the user staring at a blank dashboard
    forever.
  • API error parsing — the frontend fetchApi wrapper now
    extracts {error} / {message} from the JSON body before
    throwing, so toast messages on VM control failures, scripts,
    notifications and similar endpoints show the real backend reason
    instead of API request failed: 500 INTERNAL SERVER ERROR.

Thanks again to the testers who keep finding the edge cases — the
SIGKILL-leaking-into-the-wrong-event report and the
fatalError-classified-as-ATA report were both reproduced live and
fixed in this beta. Feedback is welcome on the same channels.


Don't miss a new ProxMenux release

NewReleases is sending notifications on new releases.