github PegaProx/project-pegaprox v0.9.12
v0.9.12 — monitoring expansion + perf + security hardening

latest release: v0.9.12.1
7 hours ago

Highlights

80 commits since v0.9.11. The big-ticket items are a critical gevent-pool truthy-bug that made every fanout in PegaProx silent-sequential since the helper was written, PEGAPROX_WORKERS actually getting plumbed through to the WSGI spawn-pool (the env var only renamed a log line before this release), nine new monitoring surfaces on the Node/VM/Insights tabs, and an SSE reconnect path that gets the UI back to green within ~30s instead of sitting on a 15s cooldown.

⚡ Performance + stability

The run_concurrent helper in pegaprox/utils/concurrent.py (and a duplicate copy in core/manager.py) checked if GEVENT_POOL and GEVENT_AVAILABLE: before dispatching. gevent.pool.Pool.__bool__ is len(self) — so on an empty pool the check was False, the helper fell through to the sequential path, and every cluster-fan-out, every PBS scan, every multi-cluster aggregator ran one-call-at-a-time. F1 numbers below are the order-of-magnitude this single is not None cost us in production:

  • /health and /vms-backup-status parallelised — vms-backup-status 10.4s → 0.45s wall (22× on dev). /health storage fanout now filters ONLINE nodes first so a dead node can't drag everything into its 10s timeout.
  • /cluster-health seven SSH probes (corosync rings + pvecm quorum + 5 systemctl checks) run as parallel greenlets capped at 12s — was sequentially up to 56s.
  • PEGAPROX_WORKERS formula changed min(cpu*2, 8)max(8, cpu*4) and actually wired through to spawn=Pool(workers) on WSGIServer. Before this release the env var only changed the startup log; gevent.pywsgi defaulted to unlimited greenlets.
  • SSE reconnect: error-cooldown 15s → 10s, immediate metrics push after a successful reconnect. Cluster drops + comes back → corporate-dashboard tile flips green inside 30s instead of waiting out the cooldown.
  • broadcast_sse now does json.dumps(..., default=str) so a stray datetime / set / bytes from a downstream serialiser doesn't silently kill the whole broadcast loop.
  • Reconnect-log spam backoff: INFO on attempt #1 + every 50th (~8min), DEBUG between. Customers running with debug-logging weren't reading the actual reconnect messages anymore.

📈 Monitoring expansion

Nine surfaces added across Node detail / VM detail / Insights tabs:

  • Disk SMART data viewer modal — replaces the old alert(JSON.stringify(...)) button. Per-attribute table, OK/threshold colour.
  • Per-NIC traffic + error/drop counters parsed out of /proc/net/dev via SSH.
  • Top-N noisy neighbors card on Insights — CPU / RAM / Disk-fill / Disk-IO / Net-IO with metric switcher.
  • Per-Tag and Per-Pool rollups on Insights (groups VMs by tag/pool, sums CPU/RAM/disk).
  • Cluster Health panel on Node detail — corosync rings, pvecm quorum view, 5 service cards (pveproxy, pvedaemon, pve-cluster, corosync, pvestatd). Renders an explainer instead of five empty ? cards when SSH isn't configured.
  • lm-sensors panel (CPU temp / fan RPM / voltages) where the host has lm-sensors installed.
  • Guest-Agent enrichment expander on VM detail: kernel_version, per-NIC IPs, real filesystem fill, logged-in users, clock-skew vs host.
  • Insights PDF export now carries Top Talkers + Rollups blocks.
  • 200+ new i18n strings across DE / EN / FR / ES / PT / KO / IT.

🛠️ New features

  • ACME DNS-01 challenge support with RFC 2136 dynamic update — BIND / PowerDNS / Knot. @gyptazy PR #429, sponsored by credativ.
  • PVE node subscription management — new Subscriptions tab in Datacenter with cluster-wide aggregator + per-node set / refresh / delete. @gyptazy PR #508.
  • Cluster Worldmap — offline geographic view with zoom + country capitals overlay (no Mapbox dependency).
  • First-run setup wizard — replaces the hardcoded pegaprox/admin default. Fresh installs go through the NOT_INITIALIZED flow and pick their own admin credentials.

🔒 Security

  • 25 client error-response sites refactored through the safe_error() helper — no more raw str(e) leaking stack-fragments, internal paths, paramiko or sqlite internals to the browser.
  • OIDC redirect_after open-redirect closure — //attacker.com was passing the old startsWith('/') gate. New _safe_internal_path rejects protocol-relative, CRLF, backslash, control chars. Mirrored client-side.
  • PBS notification create / update mass-assignment — _sanitize_pbs_kwargs whitelist; key matches ^[a-z][a-z0-9-]{0,30}$, scalar values only.
  • migrate_db._row_counts_plain SQL hotfix — v0.9.10.1 had only patched the encrypted variant with _safe_quote_ident; the plain-DB variant still concatenated table names. Aligned.
  • Three defense-in-depth audit passes — RFC-1035 node-name regex on the new monitoring endpoints, payload-limit clamp on /rollups, shlex.quote on the SSH systemctl service name, PVE-boundary regex on returned node / storage names.
  • All third-party Actions pinned to commit SHA + tag comment (CodeAnt #511, #512).
  • Template-injection fix on the release-tag interpolation in release-images.yml (env: indirection).
  • Debian cloud-image SHA512-verified before virt-customize.
  • Alert email payloads HTML-escape user-controlled fields; audit log strips CR / LF / U+2028 / U+2029 from log lines.
  • ACME directory URL routed through sanitize_outbound_url.
  • Plain-JSON config fallback removed from 5 load paths — the encrypted DB is the single source of truth, the .json siblings were only useful during migration anyway.

🐛 Customer-reported fixes

  • #413 — Site Recovery: six layers fixed across the month. broadcast_sse crash on SDN list-shape, missing get_vms shim on the Proxmox manager, target_vmid awareness for xcrepl-backed plans, qmigrate-abort detection, planned/emergency re-runnable from completed/failed status, stale-token retry in create_api_token (Proxmox occasionally returns "Token already exists" on a token that was just deleted — retry with delete-first now).
  • #438 @crcro — ESXi migration: sshfs+drive_mirror live-pivot was not updating the persisted disk reference, so the pivot succeeded on disk but PegaProx still reported the old path.
  • #444 — PVE auth circuit breaker: rotating root password on the PVE side no longer slow-DoSes pveproxy with retry storms.
  • #451 — Login text invisible in Modern UI corporateLight theme (foreground vs background colour from the wrong CSS var).
  • #455 / #456 — LXC replication race + hostname / MAC resolution.
  • #484 — HA maintenance flag preservation across reboot + SSE heartbeat.
  • #508 — Subscription-key masking on the cluster-wide aggregator (key showed up unmasked in the cluster view but was masked correctly per-node).
  • #509 — SQLCipher mlock ENOMEM spam in rootless containers. New PEGAPROX_CIPHER_MEMORY_SECURITY env with an auto default that detects rootless via euid + RLIMIT_MEMLOCK and skips the mlock attempt.
  • #510 — Per-node status log demoted info → debug (was filling up logs/<id>.log with status pings at INFO every poll).

⚙️ Defaults changed

  • PEGAPROX_WORKERS: formula now max(8, cpu_count * 4). Override via env still wins.
  • PEGAPROX_CIPHER_MEMORY_SECURITY: new env (auto / on / off, default auto). At-rest AES-256 is unchanged — this only controls the mlock attempt on the in-memory key buffer.

💎 Platinum Sponsors

Massive thanks 🙌. Sponsor PegaProx → opencollective.com/pegaprox | pegaprox.com/#sponsor


Upgrade: in-app updater, bash update.sh, or docker compose pull && docker compose up -d.
Docker: ghcr.io/pegaprox/pegaprox:v0.9.12 (linux/amd64 + linux/arm64).

Don't miss a new project-pegaprox release

NewReleases is sending notifications on new releases.